Feat: add trace info & task storage #168

ChenZiHong-Gavin · 2026-01-29T03:37:47Z

This pull request significantly enhances the graphgen framework by introducing a robust tracing and task storage system. The core BaseOperator now integrates a Key-Value storage backend, allowing for the persistence and recovery of intermediate processing states. This change standardizes data tracking using unique _trace_ids, refines data flow within operators, and improves overall system robustness. Additionally, the evaluation system has been refactored to support more specific evaluation types, generator outputs are streamlined for consistency, and file reading capabilities are enhanced with better path handling and caching.

…to feat/trace

gemini-code-assist · 2026-01-29T03:38:43Z

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive tracing and task storage system throughout the graphgen framework. By integrating a Key-Value storage backend into the core BaseOperator and its specialized subclasses, the system gains the ability to persist and recover intermediate processing states. This enhancement standardizes how data is tracked using unique _trace_ids, refines the data flow within various operators, and improves the overall robustness and observability of the graph generation pipeline. The changes also streamline generator outputs and enhance file reading capabilities by incorporating path information and caching mechanisms.

Highlights

Centralized Task Storage: Introduced a Key-Value (KV) storage mechanism (kv_storage) within BaseOperator and its subclasses to persist intermediate task results and metadata, enhancing fault tolerance and traceability across the pipeline.
Standardized Data Tracing: Implemented _trace_id generation using content hashing for all processed items, ensuring consistent tracking and recovery of data across different operational stages, from reading to generation.
Refactored Operator Processing: Significantly updated the process method in BaseOperator and its derivatives to include batch splitting into to_process and recovered items, and robust storage of results and associated metadata (meta_update).
Streamlined Generator Outputs: Generators now consistently return lists of QA dictionaries, simplifying downstream processing and removing the need for content-based hashing within the generator output itself, aligning with the new tracing approach.
Enhanced File Scanning and Reading: File readers and the parallel file scanner now include path information and integrate with the new KV storage for caching scan results, improving efficiency and data provenance.
Codebase Simplification: Removed redundant components such as RocksDBCache and community_visualizer.py, and simplified hash functions by deprecating compute_mm_hash.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

graphgen/operators/partition/partition_service.py

gemini-code-assist

Code Review

This pull request introduces significant refactoring across the graphgen library, primarily focusing on standardizing data handling and persistence within BaseOperator and its subclasses. Key changes include modifying BaseOperator to integrate a KV storage backend (rocksdb by default) for caching and recovery, introducing split and store methods to manage data flow and metadata, and updating the process method signature to return a tuple of results and metadata updates. The generate and parse_response methods across various generator models (AggregatedGenerator, AtomicGenerator, CoTGenerator, FillInBlankGenerator, MultiAnswerGenerator, MultiChoiceGenerator, MultiHopGenerator, TrueFalseGenerator, VQAGenerator) were updated to consistently return lists of dictionaries instead of dictionaries keyed by content hash, and their internal logic was adjusted accordingly. The format_generation_results method in base_generator.py was also refactored to process single results. Additionally, RocksDBCache was removed and its functionality integrated into the general KV storage. File readers (CSVReader, JSONReader, ParquetReader, RDFReader, TXTReader) were enhanced to include file paths in their output and to use the new KV storage for caching read data. The ParallelFileScanner was updated to use the new BaseKVStorage interface. A review comment highlighted an AttributeError in schema_guided_extractor.py where a Chunk object was incorrectly accessed using dictionary-like .get() instead of its id attribute.

graphgen/models/extractor/schema_guided_extractor.py

ChenZiHong-Gavin · 2026-01-29T10:14:54Z

/gemini summary

gemini-code-assist · 2026-01-29T10:16:07Z

Summary of Changes

This pull request significantly enhances the graphgen framework by introducing a robust tracing and task storage system. The core BaseOperator now integrates a Key-Value storage backend, allowing for the persistence and recovery of intermediate processing states. This change standardizes data tracking using unique _trace_ids, refines data flow within operators, and improves overall system robustness. Additionally, the evaluation system has been refactored to support more specific evaluation types, generator outputs are streamlined for consistency, and file reading capabilities are enhanced with better path handling and caching.

Highlights

Comprehensive Tracing and Task Storage: Integrated a Key-Value (KV) storage backend into the core BaseOperator and its subclasses, enabling persistence and recovery of intermediate processing states. This includes standardized _trace_id generation using content hashing and new split and store methods for robust data flow.
Refactored Evaluation System: Replaced a generic BaseEvaluator with specialized BaseQAEvaluator, BaseKGEvaluator, and BaseTripleEvaluator classes. This introduces new triple evaluation capabilities and refines existing QA and KG evaluations, with evaluators now returning dictionaries of scores asynchronously.
Simplified Generator Output Handling: Standardized generator outputs to consistently return list[dict] for QA pairs, simplifying downstream processing and removing internal content-based hashing logic within the generators themselves.
Enhanced File Reading and Scanning: Updated various file readers (csv, json, parquet, rdf, txt) to include path information and integrated the parallel file scanner with the new KV storage for efficient caching of scan results, improving data provenance.
Streamlined Operator Processing: Modified process methods across all major operators (BuildKGService, ChunkService, ExtractService, GenerateService, JudgeService, PartitionService, QuizService) to leverage the new tracing and storage mechanisms, enhancing fault tolerance and data flow management.
Codebase Cleanup and Simplification: Removed deprecated components such as RocksDBCache, community_visualizer.py, and related hash functions (compute_mm_hash), leading to a cleaner and more focused codebase. Older KG accuracy and consistency evaluators and their templates were also removed.

Changelog

examples/evaluate/evaluate_kg/kg_evaluation_config.yaml
- Updated input path from extract_demo.txt to jsonl_demo.jsonl.
- Simplified KG evaluation metrics from kg_structure, kg_accuracy, kg_consistency to just structure.
- Added target: kg parameter for evaluation.
examples/evaluate/evaluate_qa/qa_evaluation_config.yaml
- Changed default graph_backend from kuzu to networkx.
- Changed default kv_backend from rocksdb to json_kv.
- Simplified QA evaluation metrics from qa_length, qa_mtld to length, mtld.
- Added target: qa parameter for evaluation.
examples/evaluate/evaluate_triple/evaluate_triple.sh
- Added new shell script to run triple evaluation using triple_evaluation_config.yaml.
examples/evaluate/evaluate_triple/triple_evaluation_config.yaml
- Added new configuration file for triple evaluation.
- Configures networkx as graph_backend and json_kv as kv_backend.
- Defines a pipeline including read, chunk, build_kg, and evaluate steps.
- Specifies target: triple and metrics: accuracy for evaluation.
graphgen/bases/init.py
- Imported new base evaluator classes: BaseKGEvaluator, BaseQAEvaluator, BaseTripleEvaluator.
- Removed the generic BaseEvaluator import.
graphgen/bases/base_evaluator.py
- Refactored the single BaseEvaluator into three specific abstract base classes: BaseQAEvaluator, BaseKGEvaluator, and BaseTripleEvaluator.
- Updated evaluate methods to return dict[str, float] or dict[str, Any] instead of a single float.
- Made BaseQAEvaluator.evaluate and BaseTripleEvaluator.evaluate asynchronous.
graphgen/bases/base_generator.py
- Modified parse_response static method to return list[dict] instead of Any.
- Updated generate method to return list[dict] instead of dict[str, Any].
- Refactored format_generation_results to process a single dict result and removed the internal loop for flat results.
graphgen/bases/base_operator.py
- Added convert_to_serializable utility function for NumPy types.
- Modified __init__ to accept kv_backend and initialize self.kv_storage for task storage.
- Significantly refactored the __call__ method to include kv_storage.reload(), split logic for processing and recovering data, and store results.
- Introduced new methods: get_meta_forward, get_meta_inverse, get_trace_id, split, and store for managing task tracing and persistence.
- Updated process abstract method signature to return Tuple[Union[list, Iterable[list]], dict].
graphgen/bases/base_storage.py
- Added abstract methods update and delete to BaseKVStorage.
graphgen/bases/datatypes.py
- Added a static method QAPair.from_dict for convenient object creation from dictionaries.
graphgen/common/init_storage.py
- Implemented update and delete methods for JsonKVStorageWrapper and RocksDBKVStorageWrapper.
graphgen/models/init.py
- Removed ConsistencyEvaluator and RocksDBCache imports.
graphgen/models/evaluator/init.py
- Removed AccuracyEvaluator and ConsistencyEvaluator from KG evaluators.
- Added AccuracyEvaluator for triple evaluation.
graphgen/models/evaluator/kg/README.md
- Removed the entire README.md file for KG quality evaluation module.
graphgen/models/evaluator/kg/init.py
- Removed AccuracyEvaluator and ConsistencyEvaluator imports.
graphgen/models/evaluator/kg/accuracy_evaluator.py
- Removed the entire accuracy_evaluator.py file.
graphgen/models/evaluator/kg/consistency_evaluator.py
- Removed the entire consistency_evaluator.py file.
graphgen/models/evaluator/kg/structure_evaluator.py
- Updated StructureEvaluator to inherit from BaseKGEvaluator.
- Modified evaluate method to accept kg: BaseGraphStorage as an argument and removed self.graph_storage from __init__.
graphgen/models/evaluator/qa/length_evaluator.py
- Updated LengthEvaluator to inherit from BaseQAEvaluator.
- Made evaluate method asynchronous and return dict[str, float].
graphgen/models/evaluator/qa/mtld_evaluator.py
- Updated MTLDEvaluator to inherit from BaseQAEvaluator.
- Made evaluate method asynchronous and return dict[str, float].
graphgen/models/evaluator/qa/reward_evaluator.py
- Updated RewardEvaluator to inherit from BaseQAEvaluator.
- Made evaluate method asynchronous and return dict[str, float].
graphgen/models/evaluator/qa/uni_evaluator.py
- Updated UniEvaluator to inherit from BaseQAEvaluator.
- Made evaluate method asynchronous.
graphgen/models/evaluator/triple/init.py
- Added new __init__.py file for triple evaluators, importing AccuracyEvaluator.
graphgen/models/evaluator/triple/accuracy_evaluator.py
- Added new AccuracyEvaluator for triple evaluation, inheriting from BaseTripleEvaluator.
- Implements an asynchronous evaluate method that uses an LLM to assess entity and relation extraction accuracy from a given unit (chunk content, nodes, edges).
graphgen/models/extractor/schema_guided_extractor.py
- Modified extract method to accept a Chunk object directly instead of a dictionary.
- Simplified the return value of extract to a single dictionary of extracted info, removing the hash-based key.
- Removed the merge_extractions static method.
graphgen/models/generator/aggregated_generator.py
- Modified parse_response and generate methods to return list[dict].
graphgen/models/generator/atomic_generator.py
- Modified parse_response method to return list[dict].
graphgen/models/generator/cot_generator.py
- Modified generate method to return list[dict].
graphgen/models/generator/fill_in_blank_generator.py
- Modified parse_response method to return list[dict].
graphgen/models/generator/multi_answer_generator.py
- Modified parse_response method to return list[dict].
graphgen/models/generator/multi_choice_generator.py
- Modified parse_response method to return list[dict].
graphgen/models/generator/multi_hop_generator.py
- Modified parse_response method to return list[dict].
graphgen/models/generator/quiz_generator.py
- Modified build_prompt to be an instance method, not static.
- Modified parse_response to be an instance method, not static.
graphgen/models/generator/true_false_generator.py
- Modified parse_response method to return list[dict].
graphgen/models/generator/vqa_generator.py
- Modified parse_response and generate methods to return list[dict].
graphgen/models/reader/csv_reader.py
- Added include_paths=True to ray.data.read_csv to include file paths in the dataset.
graphgen/models/reader/json_reader.py
- Added path metadata to items when reading JSON files.
- Added include_paths=True to ray.data.read_json.
graphgen/models/reader/parquet_reader.py
- Added include_paths=True to ray.data.read_parquet.
graphgen/models/reader/rdf_reader.py
- Changed source_file key to path in extracted RDF documents.
graphgen/models/reader/txt_reader.py
- Changed include_paths to True in ray.data.read_binary_files.
- Added path metadata to text documents.
graphgen/models/storage/init.py
- Removed RocksDBCache import.
graphgen/models/storage/kv/json_storage.py
- Implemented update and delete methods for JsonKVStorage.
graphgen/models/storage/kv/rocksdb_storage.py
- Implemented update and delete methods for RocksDBKVStorage.
graphgen/models/storage/rocksdb_cache.py
- Removed the entire rocksdb_cache.py file, deprecating RocksDBCache.
graphgen/models/vis/community_visualizer.py
- Removed the entire community_visualizer.py file.
graphgen/operators/build_kg/build_kg_service.py
- Modified __init__ to accept kv_backend and set op_name to build_kg.
- Refactored process method to return Tuple[list, dict] and handle _trace_id and meta_update.
- Updated build_kg method to return results and meta-updates for tracing.
graphgen/operators/build_kg/build_text_kg.py
- Added a filter results = [res for res in results if res] to remove empty extraction results.
graphgen/operators/chunk/chunk_service.py
- Modified __init__ to accept kv_backend and set op_name to chunk.
- Removed explicit chunk_storage initialization, relying on BaseOperator's kv_storage.
- Refactored process method to return Tuple[list, dict] and handle _trace_id and meta_update for chunked documents.
graphgen/operators/evaluate/evaluate_kg.py
- Added new module evaluate_kg.py containing a function evaluate_kg to run multiple KG evaluators.
graphgen/operators/evaluate/evaluate_qa.py
- Added new module evaluate_qa.py containing transform_to_qa_format and evaluate_qa functions for QA evaluation.
graphgen/operators/evaluate/evaluate_service.py
- Modified __init__ to accept target (e.g., 'qa', 'kg', 'triple') and metrics directly.
- Refactored evaluator initialization into target-specific methods (_init_qa_evaluators, _init_kg_evaluators, _init_triple_evaluators).
- Updated process method to dispatch evaluation tasks to evaluate_qa, evaluate_kg, or evaluate_triple based on the target.
graphgen/operators/evaluate/evaluate_triple.py
- Added new module evaluate_triple.py containing a function evaluate_triple to run triple evaluators.
graphgen/operators/extract/extract_service.py
- Modified __init__ to accept kv_backend and set op_name to extract.
- Refactored process method to return Tuple[list, dict] and handle _trace_id and meta_update for extracted information.
graphgen/operators/generate/generate_service.py
- Modified __init__ to accept kv_backend and set op_name to generate, initializing generate_storage.
- Refactored process method to return Tuple[list, dict] and handle _trace_id and meta_update for generated QA pairs.
graphgen/operators/judge/judge_service.py
- Modified __init__ to accept kv_backend and set op_name to judge.
- Refactored process method to return Tuple[list, dict] and handle _trace_id and meta_update for judging results.
graphgen/operators/partition/partition_service.py
- Modified __init__ to accept kv_backend and set op_name to partition.
- Removed chunk_storage initialization.
- Refactored process method to return Tuple[Iterable[list], dict] using a generator for partitioned batches.
- Removed commented-out code related to _attach_additional_data_to_node.
graphgen/operators/quiz/quiz_service.py
- Modified __init__ to accept kv_backend and set op_name to quiz.
- Removed quiz_storage initialization.
- Refactored process method to return Tuple[list, dict] and handle _trace_id and meta_update for quiz results.
graphgen/operators/read/parallel_file_scanner.py
- Modified __init__ to accept input_path_cache: BaseKVStorage instead of a cache directory.
- Updated _cache_result to use upsert on the BaseKVStorage instance.
- Modified close method to call index_done_callback and del self.cache.
graphgen/operators/read/read.py
- Modified read function to accept kv_backend parameter.
- Initialized input_path_cache and read_storage using the provided kv_backend.
- Added a map_batches step to add _trace_id to records and upsert them to read_storage.
graphgen/templates/init.py
- Removed CONSISTENCY_EVALUATION_PROMPT import.
graphgen/templates/evaluation/kg/init.py
- Removed CONSISTENCY_EVALUATION_PROMPT import.
graphgen/templates/evaluation/kg/consistency_evaluation.py
- Removed the entire consistency_evaluation.py file.
graphgen/utils/init.py
- Removed compute_mm_hash import.
graphgen/utils/format.py
- Enhanced clean_str function to also strip single and double quotes from the result.
graphgen/utils/hash.py
- Removed the compute_mm_hash function.
graphgen/utils/run_concurrent.py
- Refactored run_concurrent to use a worker function that returns the task index, result, and any error.
- Improved error handling and result aggregation to preserve the order of results and log exceptions without stopping the entire concurrent process.

Activity

gemini-code-assist[bot] provided an initial summary of the changes in the pull request.
ChenZiHong-Gavin requested a summary of the pull request.
github-code-quality[bot] commented multiple times on graphgen/operators/partition/partition_service.py, recommending the removal of commented-out code, specifically a commented line and an entire commented-out function _attach_additional_data_to_node.
gemini-code-assist[bot] provided a critical suggestion in graphgen/models/extractor/schema_guided_extractor.py to change chunk.get("_chunk_id", "") to chunk.id, as the chunk parameter is now a Chunk object rather than a dictionary.

ChenZiHong-Gavin · 2026-01-29T12:44:27Z

/gemini review

gemini-code-assist

Code Review

This is a significant and well-executed refactoring that introduces a robust tracing and task storage system, greatly improving the framework's resilience and modularity. The core architectural change to BaseOperator to handle caching and recovery is a major enhancement. The refactoring of the evaluation system into more specific targets (qa, kg, triple) and the simplification of the data flow from generators are also excellent improvements. Overall, these changes provide a solid foundation for building more complex and reliable data processing pipelines. I've found one potential issue that needs addressing.

gemini-code-assist · 2026-01-29T12:59:20Z

graphgen/operators/judge/judge_service.py

+            meta_update.setdefault(input_trace_id, []).append(result["_trace_id"])
        self.graph_storage.index_done_callback()
+
+        return results, meta_update


The process method returns the results list directly from run_concurrent. However, results can contain None for tasks that failed, as run_concurrent now preserves the list length and inserts None for errors. When this list is passed to the store method in the BaseOperator, it will cause a TypeError when trying to access res["_trace_id"] on a None value. The to_store list is correctly filtered to contain only valid results and should be returned instead.

Suggested change

return results, meta_update

return to_store, meta_update

ChenZiHong-Gavin and others added 9 commits January 27, 2026 18:58

refactor: use kv_storage for cache of ParallelFileScanner

9da4e9b

fix: limit before adding new data

72099f5

Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…

03366bb

…to feat/trace

wip: trace generated data

2bb8403

Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…

2166fc2

…to feat/trace

feat: add read_storage

492c89f

fix: delete &quot in kg

857a16c

wip: add checkpoint

b735443

wip: move storage logic to baseOperator

af530de

github-actions bot added the core label Jan 29, 2026

github-code-quality bot found potential problems Jan 29, 2026

View reviewed changes

graphgen/operators/partition/partition_service.py Fixed Show fixed Hide fixed

graphgen/operators/partition/partition_service.py Fixed Show fixed Hide fixed

graphgen/operators/partition/partition_service.py Fixed Show fixed Hide fixed

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

graphgen/models/extractor/schema_guided_extractor.py Outdated Show resolved Hide resolved

refactor: refactor evaluators

7c9240a

github-actions bot added documentation Improvements or additions to documentation examples labels Jan 29, 2026

test: add e2e_test for triple_evaluation

e5d404e

github-actions bot added the tests label Jan 29, 2026

ChenZiHong-Gavin added 4 commits January 29, 2026 18:39

fix: fix lint problem

cfea2b3

fix: fix lint problem

871bb08

fix: fix lint errors

547ba1f

fix: fix lint problems

90c0a59

github-actions bot added the baselines label Jan 29, 2026

fix: fix partition service

10ebc37

ChenZiHong-Gavin mentioned this pull request Jan 29, 2026

Add checkpoint support to resume generation #151

Closed

ChenZiHong-Gavin merged commit 9bbe7ee into main Jan 29, 2026
6 checks passed

ChenZiHong-Gavin deleted the feat/trace branch January 29, 2026 12:51

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

ChenZiHong-Gavin mentioned this pull request Jan 29, 2026

[Summary] GraphGen Roadmap #49

Open

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add trace info & task storage #168

Feat: add trace info & task storage #168

Uh oh!

ChenZiHong-Gavin commented Jan 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ChenZiHong-Gavin commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Uh oh!

ChenZiHong-Gavin commented Jan 29, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feat: add trace info & task storage #168

Feat: add trace info & task storage #168

Uh oh!

Conversation

ChenZiHong-Gavin commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ChenZiHong-Gavin commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Summary of Changes

Highlights

Uh oh!

ChenZiHong-Gavin commented Jan 29, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenZiHong-Gavin commented Jan 29, 2026 •

edited

Loading