Skip to content

Conversation

@sriroopar
Copy link

@sriroopar sriroopar commented Jan 13, 2026

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and revieurlw context)

  • Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
    cursor
  • Generated by: (e.g., tool name and version; N/A if not used)
    claude

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features
    • Referenced documents now capture enhanced metadata including document IDs, product names and versions, source paths, document relevance scores, and chunk-level metadata. These additions provide richer context about referenced sources, improved attribution, and better traceability of information origins while maintaining backward compatibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci
Copy link

openshift-ci bot commented Jan 13, 2026

Hi @sriroopar. Thanks for your PR.

I'm waiting for a lightspeed-core member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

Walkthrough

The changes extend the ReferencedDocument model with six additional optional metadata fields (document_id, product_name, product_version, source_path, score, chunk_metadata) and propagate these fields through the document processing pipeline in utils/endpoints.py and query_v2.py. All new fields default to None when not provided by metadata sources.

Changes

Cohort / File(s) Summary
Model Extension
src/models/responses.py
Added six optional fields to ReferencedDocument: document_id, product_name, product_version, source_path, score, and chunk_metadata. Updated doc_url field description capitalization.
Endpoint Integration
src/app/endpoints/query_v2.py
Updated three ReferencedDocument constructor call sites to pass new optional fields (all set to None) from file_search_call results and URL/file annotations.
Metadata Propagation
src/utils/endpoints.py
Refactored four helper functions (_process_http_source, _process_document_id, _add_additional_metadata_docs, _process_rag_chunks_for_documents) to return enriched tuples including metadata_dict and score. Updated create_referenced_documents and wrapper functions to construct ReferencedDocument instances with document_id, product_name, product_version, source_path, chunk_metadata, and score fields populated from metadata sources.
Test Updates
tests/unit/cache/test_postgres_cache.py, tests/unit/models/responses/test_rag_chunk.py, tests/unit/utils/test_endpoints.py
Updated test expectations for ReferencedDocument to include new fields. Added comprehensive test coverage for metadata enrichment scenarios including backward compatibility, full metadata, product-only metadata, scoring, and dict-format output.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant QueryEndpoint as Query Endpoint
    participant ProcessUtils as Process Utils
    participant RefDoc as ReferencedDocument

    Client->>QueryEndpoint: /query request with RAG chunks
    QueryEndpoint->>ProcessUtils: create_referenced_documents(rag_chunks, metadata_map)
    
    ProcessUtils->>ProcessUtils: _process_rag_chunks_for_documents()
    Note over ProcessUtils: Extract doc_url, doc_title, metadata_dict, score<br/>from each RAG chunk
    
    ProcessUtils->>ProcessUtils: Build metadata_dict with:<br/>document_id, product_name,<br/>product_version, source_path,<br/>chunk_metadata
    
    ProcessUtils->>RefDoc: ReferencedDocument(doc_url, doc_title,<br/>document_id, product_name,<br/>product_version, source_path,<br/>score, chunk_metadata)
    
    RefDoc-->>ProcessUtils: ReferencedDocument instance
    
    ProcessUtils-->>QueryEndpoint: list[ReferencedDocument]
    QueryEndpoint-->>Client: Query response with enriched documents
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • are-ces
  • tisnik
  • asamal4
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: propagating RAG chunk metadata in responses, which is directly supported by expansions to ReferencedDocument model and metadata flow across endpoints utilities.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/models/responses.py (1)

342-365: Consider consistent type annotation style.

The new fields use str | None union syntax while existing fields use Optional[AnyUrl] and Optional[str]. While both are valid and functionally equivalent, consistency within the same model improves readability.

Also, the docstring mentions score should be "0.0 to 1.0" but no validation enforces this. Consider adding a @field_validator if out-of-range scores should be rejected.

🔧 Optional: Add score validation
+from pydantic import field_validator
+
 class ReferencedDocument(BaseModel):
     # ... existing fields ...
 
     score: float | None = Field(None, description="Relevance score from RAG retrieval")
+
+    @field_validator("score")
+    @classmethod
+    def validate_score_range(cls, v: float | None) -> float | None:
+        """Validate score is within expected range."""
+        if v is not None and not (0.0 <= v <= 1.0):
+            raise ValueError("score must be between 0.0 and 1.0")
+        return v
src/app/endpoints/query_v2.py (1)

541-553: Consider extracting available metadata from file_search_call results.

The result object contains score (as seen in parse_rag_chunks_from_responses_api at line 487) and attributes dict. This data could populate the new ReferencedDocument fields instead of hardcoding None:

  • score is available on result.score
  • attributes may contain product_name, product_version, etc.
♻️ Proposed enhancement to extract metadata
                 if filename or doc_url:
                     final_url = doc_url if doc_url else None
                     if (final_url, filename) not in seen_docs:
+                        # Extract score if available
+                        result_score = (
+                            result.get("score") if isinstance(result, dict)
+                            else getattr(result, "score", None)
+                        )
                         documents.append(
                             ReferencedDocument(
                                 doc_url=final_url,
                                 doc_title=filename,
-                                document_id=None,
-                                product_name=None,
-                                product_version=None,
-                                source_path=None,
-                                score=None,
-                                chunk_metadata=None,
+                                document_id=attributes.get("document_id"),
+                                product_name=attributes.get("product_name"),
+                                product_version=attributes.get("product_version"),
+                                source_path=attributes.get("source_path"),
+                                score=result_score,
+                                chunk_metadata={k: v for k, v in attributes.items()
+                                    if k not in {"link", "url", "doc_url", "document_id",
+                                                 "product_name", "product_version", "source_path"}}
+                                    or None,
                             )
                         )
src/utils/endpoints.py (1)

549-570: Consider extracting duplicate excluded_fields set to a module constant.

The same set of excluded field names is defined in both _process_document_id (lines 551-558) and _add_additional_metadata_docs (lines 607-614). Extract to a module-level constant to reduce duplication and ensure consistency.

♻️ Proposed refactor

Add near the top of the file (after imports):

# Fields excluded when building chunk_metadata dict
_METADATA_EXCLUDED_FIELDS = frozenset({
    "docs_url",
    "title",
    "document_id",
    "product_name",
    "product_version",
    "source_path",
    "source",
})

Then replace both inline sets:

-    excluded_fields = {
-        "docs_url",
-        "title",
-        "document_id",
-        "product_name",
-        "product_version",
-        "source_path",
-        "source",
-    }
     additional_metadata = (
-        {k: v for k, v in meta.items() if k not in excluded_fields} if meta else {}
+        {k: v for k, v in meta.items() if k not in _METADATA_EXCLUDED_FIELDS} if meta else {}
     )
tests/unit/cache/test_postgres_cache.py (1)

604-637: Inconsistent mock data between insert assertion and retrieval mock.

The insert assertion (lines 604-614) validates that all new metadata fields (document_id, product_name, product_version, source_path, score, chunk_metadata) are serialized with None values. However, the db_return_value (lines 618-628) only includes doc_url and doc_title, missing these new fields.

While this may still pass due to Pydantic defaults, it creates inconsistent test data and doesn't properly validate the round-trip behavior for the new fields. The retrieval mock should mirror what was asserted during insertion.

♻️ Suggested fix to align mock data
     # Simulate the database returning that data
     db_return_value = (
         "user message",
         "AI message",
         "foo",
         "bar",
         "start_time",
         "end_time",
-        [{"doc_url": "http://example.com/", "doc_title": "Test Doc"}],
+        [
+            {
+                "doc_url": "http://example.com/",
+                "doc_title": "Test Doc",
+                "document_id": None,
+                "product_name": None,
+                "product_version": None,
+                "source_path": None,
+                "score": None,
+                "chunk_metadata": None,
+            }
+        ],
         None,  # tool_calls
         None,  # tool_results
     )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9989b7a and 35d8377.

📒 Files selected for processing (6)
  • src/app/endpoints/query_v2.py
  • src/models/responses.py
  • src/utils/endpoints.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/models/responses/test_rag_chunk.py
  • tests/unit/utils/test_endpoints.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Use FastAPI dependencies: from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = logging.getLogger(__name__) pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, using typing_extensions.Self for model validators
Use union types with modern syntax: str | int instead of Union[str, int]
Use Optional[Type] for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Run uv run make format to auto-format code with black and ruff before completion
Run uv run make verify to run all linters (black, pyl...

Files:

  • tests/unit/models/responses/test_rag_chunk.py
  • src/models/responses.py
  • tests/unit/utils/test_endpoints.py
  • tests/unit/cache/test_postgres_cache.py
  • src/app/endpoints/query_v2.py
  • src/utils/endpoints.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Use pytest-mock for AsyncMock objects in tests
Use MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") pattern for authentication mocks in tests

Files:

  • tests/unit/models/responses/test_rag_chunk.py
  • tests/unit/utils/test_endpoints.py
  • tests/unit/cache/test_postgres_cache.py
tests/unit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion

Files:

  • tests/unit/models/responses/test_rag_chunk.py
  • tests/unit/utils/test_endpoints.py
  • tests/unit/cache/test_postgres_cache.py
src/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/models/**/*.py: Use @field_validator and @model_validator for custom validation in Pydantic models
Extend BaseModel for data Pydantic models
Use @model_validator and @field_validator for Pydantic model validation

Files:

  • src/models/responses.py
src/app/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/query_v2.py
🧠 Learnings (1)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/responses.py
🧬 Code graph analysis (4)
tests/unit/models/responses/test_rag_chunk.py (1)
src/models/responses.py (1)
  • ReferencedDocument (328-365)
tests/unit/utils/test_endpoints.py (2)
src/utils/endpoints.py (1)
  • create_referenced_documents (713-764)
src/models/responses.py (1)
  • ReferencedDocument (328-365)
src/app/endpoints/query_v2.py (1)
src/models/responses.py (1)
  • ReferencedDocument (328-365)
src/utils/endpoints.py (3)
src/cache/postgres_cache.py (1)
  • get (230-317)
src/cache/sqlite_cache.py (1)
  • get (197-284)
src/models/responses.py (1)
  • ReferencedDocument (328-365)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: E2E: server mode / ci
  • GitHub Check: E2E: library mode / ci
🔇 Additional comments (6)
src/app/endpoints/query_v2.py (1)

584-612: Explicit None assignments are acceptable for annotation-based documents.

For url_citation and file_citation annotations, the metadata fields are correctly set to None since these sources don't carry RAG-specific metadata like scores or product information. The explicit assignments document this intentional behavior.

tests/unit/models/responses/test_rag_chunk.py (1)

118-223: Good test coverage for the new ReferencedDocument fields.

The tests comprehensively cover:

  • Full metadata construction
  • Backward compatibility (existing code using only doc_url/doc_title)
  • Partial field population
  • Empty chunk_metadata dict handling

Consider adding boundary tests for the score field (e.g., 0.0, 1.0, and potentially invalid values like -0.1 or 1.5) if validation is added to the model.

tests/unit/utils/test_endpoints.py (2)

927-984: Thorough test coverage for metadata enrichment.

Tests properly validate:

  • New fields are present on ReferencedDocument objects
  • HTTP sources use URL as document_id
  • Metadata map enriches product_name, product_version, source_path
  • Scores are propagated from chunk objects

1069-1175: Good coverage of full metadata and dict format scenarios.

The tests effectively validate:

  • chunk_metadata captures additional fields (author, creation_date, category) not in top-level schema
  • Backward compatibility when metadata is absent
  • Dict format correctly populates all fields including score and chunk_metadata
src/utils/endpoints.py (2)

656-710: Score tracking and aggregation logic looks correct.

The implementation properly:

  • Tracks scores by source identifier using doc_scores dict
  • Extracts score from chunk via getattr(chunk, "score", None)
  • Preserves first-seen score per source (avoids overwriting)
  • Attaches scores to final result tuples

One edge case: if the same document appears in both rag_chunks and metadata_map, the score from the chunk is used (correct behavior since metadata_map entries don't have scores).


752-764: Unified ReferencedDocument construction is clean and consistent.

Both return formats (dict and object) correctly populate all new fields from the enriched tuple structure. The use of metadata_dict.get() safely handles missing keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant