-
Notifications
You must be signed in to change notification settings - Fork 71
Feat: add trace info & task storage #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive tracing and task storage system throughout the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant refactoring across the graphgen library, primarily focusing on standardizing data handling and persistence within BaseOperator and its subclasses. Key changes include modifying BaseOperator to integrate a KV storage backend (rocksdb by default) for caching and recovery, introducing split and store methods to manage data flow and metadata, and updating the process method signature to return a tuple of results and metadata updates. The generate and parse_response methods across various generator models (AggregatedGenerator, AtomicGenerator, CoTGenerator, FillInBlankGenerator, MultiAnswerGenerator, MultiChoiceGenerator, MultiHopGenerator, TrueFalseGenerator, VQAGenerator) were updated to consistently return lists of dictionaries instead of dictionaries keyed by content hash, and their internal logic was adjusted accordingly. The format_generation_results method in base_generator.py was also refactored to process single results. Additionally, RocksDBCache was removed and its functionality integrated into the general KV storage. File readers (CSVReader, JSONReader, ParquetReader, RDFReader, TXTReader) were enhanced to include file paths in their output and to use the new KV storage for caching read data. The ParallelFileScanner was updated to use the new BaseKVStorage interface. A review comment highlighted an AttributeError in schema_guided_extractor.py where a Chunk object was incorrectly accessed using dictionary-like .get() instead of its id attribute.
|
/gemini summary |
Summary of ChangesThis pull request significantly enhances the Highlights
Changelog
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a significant and well-executed refactoring that introduces a robust tracing and task storage system, greatly improving the framework's resilience and modularity. The core architectural change to BaseOperator to handle caching and recovery is a major enhancement. The refactoring of the evaluation system into more specific targets (qa, kg, triple) and the simplification of the data flow from generators are also excellent improvements. Overall, these changes provide a solid foundation for building more complex and reliable data processing pipelines. I've found one potential issue that needs addressing.
| meta_update.setdefault(input_trace_id, []).append(result["_trace_id"]) | ||
| self.graph_storage.index_done_callback() | ||
|
|
||
| return results, meta_update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process method returns the results list directly from run_concurrent. However, results can contain None for tasks that failed, as run_concurrent now preserves the list length and inserts None for errors. When this list is passed to the store method in the BaseOperator, it will cause a TypeError when trying to access res["_trace_id"] on a None value. The to_store list is correctly filtered to contain only valid results and should be returned instead.
| return results, meta_update | |
| return to_store, meta_update |
This pull request significantly enhances the
graphgenframework by introducing a robust tracing and task storage system. The coreBaseOperatornow integrates a Key-Value storage backend, allowing for the persistence and recovery of intermediate processing states. This change standardizes data tracking using unique_trace_ids, refines data flow within operators, and improves overall system robustness. Additionally, the evaluation system has been refactored to support more specific evaluation types, generator outputs are streamlined for consistency, and file reading capabilities are enhanced with better path handling and caching.