-
Notifications
You must be signed in to change notification settings - Fork 62
LCORE-478: Propagate RAG chunk metadata in the response. #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @sriroopar. Thanks for your PR. I'm waiting for a lightspeed-core member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
WalkthroughThe changes extend the ReferencedDocument model with six additional optional metadata fields (document_id, product_name, product_version, source_path, score, chunk_metadata) and propagate these fields through the document processing pipeline in utils/endpoints.py and query_v2.py. All new fields default to None when not provided by metadata sources. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant QueryEndpoint as Query Endpoint
participant ProcessUtils as Process Utils
participant RefDoc as ReferencedDocument
Client->>QueryEndpoint: /query request with RAG chunks
QueryEndpoint->>ProcessUtils: create_referenced_documents(rag_chunks, metadata_map)
ProcessUtils->>ProcessUtils: _process_rag_chunks_for_documents()
Note over ProcessUtils: Extract doc_url, doc_title, metadata_dict, score<br/>from each RAG chunk
ProcessUtils->>ProcessUtils: Build metadata_dict with:<br/>document_id, product_name,<br/>product_version, source_path,<br/>chunk_metadata
ProcessUtils->>RefDoc: ReferencedDocument(doc_url, doc_title,<br/>document_id, product_name,<br/>product_version, source_path,<br/>score, chunk_metadata)
RefDoc-->>ProcessUtils: ReferencedDocument instance
ProcessUtils-->>QueryEndpoint: list[ReferencedDocument]
QueryEndpoint-->>Client: Query response with enriched documents
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/models/responses.py (1)
342-365: Consider consistent type annotation style.The new fields use
str | Noneunion syntax while existing fields useOptional[AnyUrl]andOptional[str]. While both are valid and functionally equivalent, consistency within the same model improves readability.Also, the docstring mentions
scoreshould be "0.0 to 1.0" but no validation enforces this. Consider adding a@field_validatorif out-of-range scores should be rejected.🔧 Optional: Add score validation
+from pydantic import field_validator + class ReferencedDocument(BaseModel): # ... existing fields ... score: float | None = Field(None, description="Relevance score from RAG retrieval") + + @field_validator("score") + @classmethod + def validate_score_range(cls, v: float | None) -> float | None: + """Validate score is within expected range.""" + if v is not None and not (0.0 <= v <= 1.0): + raise ValueError("score must be between 0.0 and 1.0") + return vsrc/app/endpoints/query_v2.py (1)
541-553: Consider extracting available metadata from file_search_call results.The
resultobject containsscore(as seen inparse_rag_chunks_from_responses_apiat line 487) andattributesdict. This data could populate the newReferencedDocumentfields instead of hardcodingNone:
scoreis available onresult.scoreattributesmay containproduct_name,product_version, etc.♻️ Proposed enhancement to extract metadata
if filename or doc_url: final_url = doc_url if doc_url else None if (final_url, filename) not in seen_docs: + # Extract score if available + result_score = ( + result.get("score") if isinstance(result, dict) + else getattr(result, "score", None) + ) documents.append( ReferencedDocument( doc_url=final_url, doc_title=filename, - document_id=None, - product_name=None, - product_version=None, - source_path=None, - score=None, - chunk_metadata=None, + document_id=attributes.get("document_id"), + product_name=attributes.get("product_name"), + product_version=attributes.get("product_version"), + source_path=attributes.get("source_path"), + score=result_score, + chunk_metadata={k: v for k, v in attributes.items() + if k not in {"link", "url", "doc_url", "document_id", + "product_name", "product_version", "source_path"}} + or None, ) )src/utils/endpoints.py (1)
549-570: Consider extracting duplicateexcluded_fieldsset to a module constant.The same set of excluded field names is defined in both
_process_document_id(lines 551-558) and_add_additional_metadata_docs(lines 607-614). Extract to a module-level constant to reduce duplication and ensure consistency.♻️ Proposed refactor
Add near the top of the file (after imports):
# Fields excluded when building chunk_metadata dict _METADATA_EXCLUDED_FIELDS = frozenset({ "docs_url", "title", "document_id", "product_name", "product_version", "source_path", "source", })Then replace both inline sets:
- excluded_fields = { - "docs_url", - "title", - "document_id", - "product_name", - "product_version", - "source_path", - "source", - } additional_metadata = ( - {k: v for k, v in meta.items() if k not in excluded_fields} if meta else {} + {k: v for k, v in meta.items() if k not in _METADATA_EXCLUDED_FIELDS} if meta else {} )tests/unit/cache/test_postgres_cache.py (1)
604-637: Inconsistent mock data between insert assertion and retrieval mock.The insert assertion (lines 604-614) validates that all new metadata fields (
document_id,product_name,product_version,source_path,score,chunk_metadata) are serialized withNonevalues. However, thedb_return_value(lines 618-628) only includesdoc_urlanddoc_title, missing these new fields.While this may still pass due to Pydantic defaults, it creates inconsistent test data and doesn't properly validate the round-trip behavior for the new fields. The retrieval mock should mirror what was asserted during insertion.
♻️ Suggested fix to align mock data
# Simulate the database returning that data db_return_value = ( "user message", "AI message", "foo", "bar", "start_time", "end_time", - [{"doc_url": "http://example.com/", "doc_title": "Test Doc"}], + [ + { + "doc_url": "http://example.com/", + "doc_title": "Test Doc", + "document_id": None, + "product_name": None, + "product_version": None, + "source_path": None, + "score": None, + "chunk_metadata": None, + } + ], None, # tool_calls None, # tool_results )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/app/endpoints/query_v2.pysrc/models/responses.pysrc/utils/endpoints.pytests/unit/cache/test_postgres_cache.pytests/unit/models/responses/test_rag_chunk.pytests/unit/utils/test_endpoints.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Use FastAPI dependencies:from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, usingtyping_extensions.Selffor model validators
Use union types with modern syntax:str | intinstead ofUnion[str, int]
UseOptional[Type]for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Uselogger.debug()for detailed diagnostic information
Uselogger.info()for general information about program execution
Uselogger.warning()for unexpected events or potential problems
Uselogger.error()for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes:Configuration,Error/Exception,Resolver,Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Runuv run make formatto auto-format code with black and ruff before completion
Runuv run make verifyto run all linters (black, pyl...
Files:
tests/unit/models/responses/test_rag_chunk.pysrc/models/responses.pytests/unit/utils/test_endpoints.pytests/unit/cache/test_postgres_cache.pysrc/app/endpoints/query_v2.pysrc/utils/endpoints.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Usepytest-mockfor AsyncMock objects in tests
UseMOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token")pattern for authentication mocks in tests
Files:
tests/unit/models/responses/test_rag_chunk.pytests/unit/utils/test_endpoints.pytests/unit/cache/test_postgres_cache.py
tests/unit/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion
Files:
tests/unit/models/responses/test_rag_chunk.pytests/unit/utils/test_endpoints.pytests/unit/cache/test_postgres_cache.py
src/models/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/models/**/*.py: Use@field_validatorand@model_validatorfor custom validation in Pydantic models
ExtendBaseModelfor data Pydantic models
Use@model_validatorand@field_validatorfor Pydantic model validation
Files:
src/models/responses.py
src/app/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use FastAPI
HTTPExceptionwith appropriate status codes for API endpoint error handling
Files:
src/app/endpoints/query_v2.py
🧠 Learnings (1)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/responses.py
🧬 Code graph analysis (4)
tests/unit/models/responses/test_rag_chunk.py (1)
src/models/responses.py (1)
ReferencedDocument(328-365)
tests/unit/utils/test_endpoints.py (2)
src/utils/endpoints.py (1)
create_referenced_documents(713-764)src/models/responses.py (1)
ReferencedDocument(328-365)
src/app/endpoints/query_v2.py (1)
src/models/responses.py (1)
ReferencedDocument(328-365)
src/utils/endpoints.py (3)
src/cache/postgres_cache.py (1)
get(230-317)src/cache/sqlite_cache.py (1)
get(197-284)src/models/responses.py (1)
ReferencedDocument(328-365)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: E2E: server mode / ci
- GitHub Check: E2E: library mode / ci
🔇 Additional comments (6)
src/app/endpoints/query_v2.py (1)
584-612: Explicit None assignments are acceptable for annotation-based documents.For
url_citationandfile_citationannotations, the metadata fields are correctly set toNonesince these sources don't carry RAG-specific metadata like scores or product information. The explicit assignments document this intentional behavior.tests/unit/models/responses/test_rag_chunk.py (1)
118-223: Good test coverage for the new ReferencedDocument fields.The tests comprehensively cover:
- Full metadata construction
- Backward compatibility (existing code using only
doc_url/doc_title)- Partial field population
- Empty
chunk_metadatadict handlingConsider adding boundary tests for the
scorefield (e.g.,0.0,1.0, and potentially invalid values like-0.1or1.5) if validation is added to the model.tests/unit/utils/test_endpoints.py (2)
927-984: Thorough test coverage for metadata enrichment.Tests properly validate:
- New fields are present on
ReferencedDocumentobjects- HTTP sources use URL as
document_id- Metadata map enriches
product_name,product_version,source_path- Scores are propagated from chunk objects
1069-1175: Good coverage of full metadata and dict format scenarios.The tests effectively validate:
chunk_metadatacaptures additional fields (author, creation_date, category) not in top-level schema- Backward compatibility when metadata is absent
- Dict format correctly populates all fields including
scoreandchunk_metadatasrc/utils/endpoints.py (2)
656-710: Score tracking and aggregation logic looks correct.The implementation properly:
- Tracks scores by source identifier using
doc_scoresdict- Extracts score from chunk via
getattr(chunk, "score", None)- Preserves first-seen score per source (avoids overwriting)
- Attaches scores to final result tuples
One edge case: if the same document appears in both
rag_chunksandmetadata_map, the score from the chunk is used (correct behavior since metadata_map entries don't have scores).
752-764: Unified ReferencedDocument construction is clean and consistent.Both return formats (dict and object) correctly populate all new fields from the enriched tuple structure. The use of
metadata_dict.get()safely handles missing keys.
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and revieurlw context)
cursor
claude
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.