Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,14 @@ Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture co
After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [xtuner](https://github.com/InternLM/xtuner) to finetune your LLMs.

## 📌 Latest Updates
- **2026.02.04**: We support HuggingFace Datasets as input data source for data generation now.
- **2026.01.15**: **LLM benchmark synthesis** now supports single/multiple-choice & fill-in-the-blank & true-or-false—ideal for education 🌟🌟
- **2025.12.26**: Knowledge graph evaluation metrics about accuracy (entity/relation), consistency (conflict detection), structural robustness (noise, connectivity, degree distribution)
- **2025.12.16**: Added [rocksdb](https://github.com/facebook/rocksdb) for key-value storage backend and [kuzudb](https://github.com/kuzudb/kuzu) for graph database backend support.

<details>
<summary>History</summary>

- **2025.12.16**: Added [rocksdb](https://github.com/facebook/rocksdb) for key-value storage backend and [kuzudb](https://github.com/kuzudb/kuzu) for graph database backend support.
- **2025.12.16**: Added [vllm](https://github.com/vllm-project/vllm) for local inference backend support.
- **2025.12.16**: Refactored the data generation pipeline using [ray](https://github.com/ray-project/ray) to improve the efficiency of distributed execution and resource management.
- **2025.12.1**: Added search support for [NCBI](https://www.ncbi.nlm.nih.gov/) and [RNAcentral](https://rnacentral.org/) databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
Expand Down
3 changes: 2 additions & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,14 +62,15 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
在数据生成后,您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 和 [xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。

## 📌 最新功能
- **2026.02.04**:支持使用直接读入 HuggingFace 数据集进行数据生成
- **2026.01.15**:合成垂域评测数据(单选题、多选题、填空题和判断题型)🌟🌟
- **2025.12.26**:引入知识图谱评估指标,包括准确度评估(实体/关系抽取质量)、一致性评估(冲突检测)和结构鲁棒性评估(噪声比、连通性、度分布)
- **2025.12.16**:支持 [rocksdb](https://github.com/facebook/rocksdb) 作为键值存储后端, [kuzudb](https://github.com/kuzudb/kuzu) 作为图数据库后端


<details>
<summary>历史更新记录</summary>

- **2025.12.16**:支持 [rocksdb](https://github.com/facebook/rocksdb) 作为键值存储后端, [kuzudb](https://github.com/kuzudb/kuzu) 作为图数据库后端。
- **2025.12.16**:支持 [vllm](https://github.com/vllm-project/vllm) 作为本地推理后端。
- **2025.12.16**:使用 [ray](https://github.com/ray-project/ray) 重构了数据生成 pipeline,提升了分布式执行和资源管理的效率。
- **2025.12.1**:新增对 [NCBI](https://www.ncbi.nlm.nih.gov/) 和 [RNAcentral](https://rnacentral.org/) 数据库的检索支持,现在可以从这些生物信息学数据库中提取DNA和RNA数据。
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_aggregated_qa/huggingface_config.yaml
83 changes: 83 additions & 0 deletions examples/generate/generate_aggregated_qa/huggingface_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
global_params:
working_dir: cache
graph_backend: networkx # graph database backend, support: kuzu, networkx
kv_backend: json_kv # key-value store backend, support: rocksdb, json_kv

nodes:
- id: read_hf_dataset # Read from Hugging Face Hub
op_name: read
type: source
dependencies: []
params:
input_path:
- huggingface://wikitext:wikitext-103-v1:train # Format: huggingface://dataset_name:subset:split
# Optional parameters for HuggingFaceReader:
text_column: text # Column name containing text content (default: content)
# cache_dir: /path/to/cache # Optional: directory to cache downloaded datasets
# trust_remote_code: false # Optional: whether to trust remote code in datasets

- id: chunk_documents
op_name: chunk
type: map_batch
dependencies:
- read_hf_dataset
execution_params:
replicas: 4
params:
chunk_size: 1024
chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk_documents
execution_params:
replicas: 1
batch_size: 128

- id: quiz
op_name: quiz
type: map_batch
dependencies:
- build_kg
execution_params:
replicas: 1
batch_size: 128
params:
quiz_samples: 2

- id: judge
op_name: judge
type: map_batch
dependencies:
- quiz
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- judge
params:
method: ece
method_params:
max_units_per_community: 20
min_units_per_community: 5
max_tokens_per_community: 10240
unit_sampling: max_loss

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
save_output: true
params:
method: aggregated
data_format: ChatML
2 changes: 2 additions & 0 deletions graphgen/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
)
from .reader import (
CSVReader,
HuggingFaceReader,
JSONReader,
ParquetReader,
PDFReader,
Expand Down Expand Up @@ -92,6 +93,7 @@
"PickleReader": ".reader",
"RDFReader": ".reader",
"TXTReader": ".reader",
"HuggingFaceReader": ".reader",
# Searcher
"NCBISearch": ".searcher.db.ncbi_searcher",
"RNACentralSearch": ".searcher.db.rnacentral_searcher",
Expand Down
1 change: 1 addition & 0 deletions graphgen/models/reader/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .csv_reader import CSVReader
from .huggingface_reader import HuggingFaceReader
from .json_reader import JSONReader
from .parquet_reader import ParquetReader
from .pdf_reader import PDFReader
Expand Down
201 changes: 201 additions & 0 deletions graphgen/models/reader/huggingface_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
"""
Hugging Face Datasets Reader
This module provides a reader for accessing datasets from Hugging Face Hub.
"""

from typing import TYPE_CHECKING, List, Optional, Union

from graphgen.bases.base_reader import BaseReader

if TYPE_CHECKING:
import numpy as np
import ray
from ray.data import Dataset


class HuggingFaceReader(BaseReader):
"""
Reader for Hugging Face Datasets.

Supports loading datasets from the Hugging Face Hub.
Can specify a dataset by name and optional subset/split.

Columns:
- type: The type of the document (e.g., "text", "image", etc.)
- if type is "text", "content" column must be present (or specify via text_column).

Example:
reader = HuggingFaceReader(text_column="text")
ds = reader.read("wikitext")
# or with split and subset
ds = reader.read("wikitext:wikitext-103-v1:train")
"""

def __init__(
self,
text_column: str = "content",
modalities: Optional[list] = None,
cache_dir: Optional[str] = None,
trust_remote_code: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The HuggingFaceReader class introduces the trust_remote_code parameter which is passed directly to the Hugging Face datasets.load_dataset function. When set to True, this allows the execution of arbitrary Python code contained within the dataset repository (e.g., in the loading script). Since this parameter is exposed to the end-user via the configuration file (through reader_kwargs), it creates a significant risk of Remote Code Execution (RCE) if an attacker can provide or influence the configuration. While the default is False, exposing this dangerous functionality to the configuration without adequate warnings or restrictions is a security concern. Consider removing this parameter from the configuration or implementing a strict allow-list for trusted datasets.

):
"""
Initialize HuggingFaceReader.

:param text_column: Column name containing text content
:param modalities: List of supported modalities
:param cache_dir: Directory to cache downloaded datasets
:param trust_remote_code: Whether to trust remote code in datasets
"""
super().__init__(text_column=text_column, modalities=modalities)
self.cache_dir = cache_dir
self.trust_remote_code = trust_remote_code

def read(
self,
input_path: Union[str, List[str]],
split: Optional[str] = None,
subset: Optional[str] = None,
streaming: bool = False,
limit: Optional[int] = None,
) -> "Dataset":
"""
Read dataset from Hugging Face Hub.

:param input_path: Dataset identifier(s) from Hugging Face Hub
Format: "dataset_name" or "dataset_name:subset:split"
Example: "wikitext" or "wikitext:wikitext-103-v1:train"
:param split: Specific split to load (overrides split in path)
:param subset: Specific subset/configuration to load (overrides subset in path)
:param streaming: Whether to stream the dataset instead of downloading
:param limit: Maximum number of samples to load
:return: Ray Dataset containing the data
"""
try:
import datasets as hf_datasets
except ImportError as exc:
raise ImportError(
"The 'datasets' package is required to use HuggingFaceReader. "
"Please install it with: pip install datasets"
) from exc

if isinstance(input_path, list):
# Handle multiple datasets
all_dss = []
for path in input_path:
ds = self._load_single_dataset(
path,
split=split,
subset=subset,
streaming=streaming,
limit=limit,
hf_datasets=hf_datasets,
)
all_dss.append(ds)

if len(all_dss) == 1:
combined_ds = all_dss[0]
else:
combined_ds = all_dss[0].union(*all_dss[1:])
Comment on lines +95 to +98
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If input_path is an empty list, all_dss will also be empty. This will cause an IndexError on line 97 when trying to access all_dss[0]. You should handle the case of an empty list of datasets to avoid this crash.

            if not all_dss:
                import ray

                return ray.data.from_items([])

            if len(all_dss) == 1:
                combined_ds = all_dss[0]
            else:
                combined_ds = all_dss[0].union(*all_dss[1:])

Comment on lines +81 to +98
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If input_path is an empty list, all_dss will also be empty. This will cause an IndexError on line 98 (or 96) when trying to access all_dss[0]. You should handle the case of an empty input_path list to prevent a crash.

        if isinstance(input_path, list):
            if not input_path:
                import ray

                return ray.data.from_items([])

            # Handle multiple datasets
            all_dss = []
            for path in input_path:
                ds = self._load_single_dataset(
                    path,
                    split=split,
                    subset=subset,
                    streaming=streaming,
                    limit=limit,
                    hf_datasets=hf_datasets,
                )
                all_dss.append(ds)

            if len(all_dss) == 1:
                combined_ds = all_dss[0]
            else:
                combined_ds = all_dss[0].union(*all_dss[1:])

else:
combined_ds = self._load_single_dataset(
input_path,
split=split,
subset=subset,
streaming=streaming,
limit=limit,
hf_datasets=hf_datasets,
)

# Validate and filter
combined_ds = combined_ds.map_batches(
self._validate_batch, batch_format="pandas"
)
combined_ds = combined_ds.filter(self._should_keep_item)

return combined_ds

def _load_single_dataset(
self,
dataset_path: str,
split: Optional[str] = None,
subset: Optional[str] = None,
streaming: bool = False,
limit: Optional[int] = None,
hf_datasets=None,
) -> "Dataset":
"""
Load a single dataset from Hugging Face Hub.

:param dataset_path: Dataset path, can include subset and split
:param split: Override split
:param subset: Override subset
:param streaming: Whether to stream
:param limit: Max samples
:param hf_datasets: Imported datasets module
:return: Ray Dataset
"""
import numpy as np
import ray

# Parse dataset path format: "dataset_name:subset:split"
parts = dataset_path.split(":")
dataset_name = parts[0]
parsed_subset = parts[1] if len(parts) > 1 else None
parsed_split = parts[2] if len(parts) > 2 else None

# Override with explicit parameters
final_subset = subset or parsed_subset
final_split = split or parsed_split or "train"

# Load dataset from Hugging Face
load_kwargs = {
"cache_dir": self.cache_dir,
"trust_remote_code": self.trust_remote_code,
"streaming": streaming,
}

if final_subset:
load_kwargs["name"] = final_subset

hf_dataset = hf_datasets.load_dataset(
dataset_name, split=final_split, **load_kwargs
)

# Apply limit before converting to Ray dataset for memory efficiency
if limit:
if streaming:
hf_dataset = hf_dataset.take(limit)
else:
hf_dataset = hf_dataset.select(range(limit))

# Convert to Ray dataset using lazy evaluation
ray_ds = ray.data.from_huggingface(hf_dataset)

# Define batch processing function for lazy evaluation
def _process_batch(batch: dict[str, "np.ndarray"]) -> dict[str, "np.ndarray"]:
"""
Process a batch of data to add type field and rename text column.

:param batch: A dictionary with column names as keys and numpy arrays
:return: Processed batch dictionary with numpy arrays
"""
if not batch:
return {}

# Get the number of rows in the batch
num_rows = len(next(iter(batch.values())))

# Add type field if not present
if "type" not in batch:
batch["type"] = np.array(["text"] * num_rows)

# Rename text_column to 'content' if different
if self.text_column != "content" and self.text_column in batch:
batch["content"] = batch.pop(self.text_column)

return batch

# Apply post-processing using map_batches for distributed lazy evaluation
ray_ds = ray_ds.map_batches(_process_batch)

return ray_ds
Loading
Loading