LCORE-478: Propagate RAG chunk metadata in the response. #990

sriroopar · 2026-01-13T20:05:43Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and revieurlw context)

Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
cursor
Generated by: (e.g., tool name and version; N/A if not used)
claude

Related Tickets & Documents

Related Issue #
Closes LCORE-478

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Referenced documents now capture enhanced metadata including document IDs, product names and versions, source paths, document relevance scores, and chunk-level metadata. These additions provide richer context about referenced sources, improved attribution, and better traceability of information origins while maintaining backward compatibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

openshift-ci · 2026-01-13T20:05:53Z

Hi @sriroopar. Thanks for your PR.

I'm waiting for a lightspeed-core member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-01-13T20:05:58Z

Walkthrough

The changes extend the ReferencedDocument model with six additional optional metadata fields (document_id, product_name, product_version, source_path, score, chunk_metadata) and propagate these fields through the document processing pipeline in utils/endpoints.py and query_v2.py. All new fields default to None when not provided by metadata sources.

Changes

Cohort / File(s)	Summary
Model Extension `src/models/responses.py`	Added six optional fields to ReferencedDocument: document_id, product_name, product_version, source_path, score, and chunk_metadata. Updated doc_url field description capitalization.
Endpoint Integration `src/app/endpoints/query_v2.py`	Updated three ReferencedDocument constructor call sites to pass new optional fields (all set to None) from file_search_call results and URL/file annotations.
Metadata Propagation `src/utils/endpoints.py`	Refactored four helper functions (_process_http_source, _process_document_id, _add_additional_metadata_docs, _process_rag_chunks_for_documents) to return enriched tuples including metadata_dict and score. Updated create_referenced_documents and wrapper functions to construct ReferencedDocument instances with document_id, product_name, product_version, source_path, chunk_metadata, and score fields populated from metadata sources.
Test Updates `tests/unit/cache/test_postgres_cache.py`, `tests/unit/models/responses/test_rag_chunk.py`, `tests/unit/utils/test_endpoints.py`	Updated test expectations for ReferencedDocument to include new fields. Added comprehensive test coverage for metadata enrichment scenarios including backward compatibility, full metadata, product-only metadata, scoring, and dict-format output.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant QueryEndpoint as Query Endpoint
    participant ProcessUtils as Process Utils
    participant RefDoc as ReferencedDocument

    Client->>QueryEndpoint: /query request with RAG chunks
    QueryEndpoint->>ProcessUtils: create_referenced_documents(rag_chunks, metadata_map)
    
    ProcessUtils->>ProcessUtils: _process_rag_chunks_for_documents()
    Note over ProcessUtils: Extract doc_url, doc_title, metadata_dict, score<br/>from each RAG chunk
    
    ProcessUtils->>ProcessUtils: Build metadata_dict with:<br/>document_id, product_name,<br/>product_version, source_path,<br/>chunk_metadata
    
    ProcessUtils->>RefDoc: ReferencedDocument(doc_url, doc_title,<br/>document_id, product_name,<br/>product_version, source_path,<br/>score, chunk_metadata)
    
    RefDoc-->>ProcessUtils: ReferencedDocument instance
    
    ProcessUtils-->>QueryEndpoint: list[ReferencedDocument]
    QueryEndpoint-->>Client: Query response with enriched documents

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

LCORE-693: Added rag_chunks to streaming_query #585: Extends create_referenced_documents and RAG-chunk processing flow with additional metadata and scoring propagation.
LCORE-601: Add RAG chunks in query response #550: Introduces the foundational ReferencedDocument model and initial document retrieval integration in query/endpoints flow.
[RHDHPAI-1143] Implement referenced_documents caching #643: Modifies the ReferencedDocument model and endpoints/caching logic for referenced document handling.

Suggested reviewers

are-ces
tisnik
asamal4

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: propagating RAG chunk metadata in responses, which is directly supported by expansions to ReferencedDocument model and metadata flow across endpoints utilities.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/models/responses.py (1)

342-365: Consider consistent type annotation style.

The new fields use str | None union syntax while existing fields use Optional[AnyUrl] and Optional[str]. While both are valid and functionally equivalent, consistency within the same model improves readability.

Also, the docstring mentions score should be "0.0 to 1.0" but no validation enforces this. Consider adding a @field_validator if out-of-range scores should be rejected.
🔧 Optional: Add score validation
+from pydantic import field_validator
+
 class ReferencedDocument(BaseModel):
     # ... existing fields ...
 
     score: float | None = Field(None, description="Relevance score from RAG retrieval")
+
+    @field_validator("score")
+    @classmethod
+    def validate_score_range(cls, v: float | None) -> float | None:
+        """Validate score is within expected range."""
+        if v is not None and not (0.0 <= v <= 1.0):
+            raise ValueError("score must be between 0.0 and 1.0")
+        return v

src/app/endpoints/query_v2.py (1)

541-553: Consider extracting available metadata from file_search_call results.

The result object contains score (as seen in parse_rag_chunks_from_responses_api at line 487) and attributes dict. This data could populate the new ReferencedDocument fields instead of hardcoding None:

score is available on result.score
attributes may contain product_name, product_version, etc.

♻️ Proposed enhancement to extract metadata

                 if filename or doc_url:
                     final_url = doc_url if doc_url else None
                     if (final_url, filename) not in seen_docs:
+                        # Extract score if available
+                        result_score = (
+                            result.get("score") if isinstance(result, dict)
+                            else getattr(result, "score", None)
+                        )
                         documents.append(
                             ReferencedDocument(
                                 doc_url=final_url,
                                 doc_title=filename,
-                                document_id=None,
-                                product_name=None,
-                                product_version=None,
-                                source_path=None,
-                                score=None,
-                                chunk_metadata=None,
+                                document_id=attributes.get("document_id"),
+                                product_name=attributes.get("product_name"),
+                                product_version=attributes.get("product_version"),
+                                source_path=attributes.get("source_path"),
+                                score=result_score,
+                                chunk_metadata={k: v for k, v in attributes.items()
+                                    if k not in {"link", "url", "doc_url", "document_id",
+                                                 "product_name", "product_version", "source_path"}}
+                                    or None,
                             )
                         )

src/utils/endpoints.py (1)

549-570: Consider extracting duplicate excluded_fields set to a module constant.

The same set of excluded field names is defined in both _process_document_id (lines 551-558) and _add_additional_metadata_docs (lines 607-614). Extract to a module-level constant to reduce duplication and ensure consistency.

♻️ Proposed refactor

Add near the top of the file (after imports):

# Fields excluded when building chunk_metadata dict
_METADATA_EXCLUDED_FIELDS = frozenset({
    "docs_url",
    "title",
    "document_id",
    "product_name",
    "product_version",
    "source_path",
    "source",
})

Then replace both inline sets:

-    excluded_fields = {
-        "docs_url",
-        "title",
-        "document_id",
-        "product_name",
-        "product_version",
-        "source_path",
-        "source",
-    }
     additional_metadata = (
-        {k: v for k, v in meta.items() if k not in excluded_fields} if meta else {}
+        {k: v for k, v in meta.items() if k not in _METADATA_EXCLUDED_FIELDS} if meta else {}
     )

tests/unit/cache/test_postgres_cache.py (1)

604-637: Inconsistent mock data between insert assertion and retrieval mock.

The insert assertion (lines 604-614) validates that all new metadata fields (document_id, product_name, product_version, source_path, score, chunk_metadata) are serialized with None values. However, the db_return_value (lines 618-628) only includes doc_url and doc_title, missing these new fields.

While this may still pass due to Pydantic defaults, it creates inconsistent test data and doesn't properly validate the round-trip behavior for the new fields. The retrieval mock should mirror what was asserted during insertion.
♻️ Suggested fix to align mock data
     # Simulate the database returning that data
     db_return_value = (
         "user message",
         "AI message",
         "foo",
         "bar",
         "start_time",
         "end_time",
-        [{"doc_url": "http://example.com/", "doc_title": "Test Doc"}],
+        [
+            {
+                "doc_url": "http://example.com/",
+                "doc_title": "Test Doc",
+                "document_id": None,
+                "product_name": None,
+                "product_version": None,
+                "source_path": None,
+                "score": None,
+                "chunk_metadata": None,
+            }
+        ],
         None,  # tool_calls
         None,  # tool_results
     )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9989b7a and 35d8377.

📒 Files selected for processing (6)

src/app/endpoints/query_v2.py
src/models/responses.py
src/utils/endpoints.py
tests/unit/cache/test_postgres_cache.py
tests/unit/models/responses/test_rag_chunk.py
tests/unit/utils/test_endpoints.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py