InternScience · ChenZiHong-Gavin · Feb 4, 2026 · Feb 4, 2026 · Feb 4, 2026 · Feb 4, 2026
diff --git a/README.md b/README.md
@@ -61,13 +61,14 @@ Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture co
 After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [xtuner](https://github.com/InternLM/xtuner) to finetune your LLMs.
 
 ## 📌 Latest Updates
+- **2026.02.04**: We support HuggingFace Datasets as input data source for data generation now.
 - **2026.01.15**: **LLM benchmark synthesis** now supports single/multiple-choice & fill-in-the-blank & true-or-false—ideal for education 🌟🌟
 - **2025.12.26**: Knowledge graph evaluation metrics about accuracy (entity/relation), consistency (conflict detection), structural robustness (noise, connectivity, degree distribution)
-- **2025.12.16**: Added [rocksdb](https://github.com/facebook/rocksdb) for key-value storage backend and [kuzudb](https://github.com/kuzudb/kuzu) for graph database backend support.
 
 <details>
 <summary>History</summary>
 
+- **2025.12.16**: Added [rocksdb](https://github.com/facebook/rocksdb) for key-value storage backend and [kuzudb](https://github.com/kuzudb/kuzu) for graph database backend support.
 - **2025.12.16**: Added [vllm](https://github.com/vllm-project/vllm) for local inference backend support.
 - **2025.12.16**: Refactored the data generation pipeline using [ray](https://github.com/ray-project/ray) to improve the efficiency of distributed execution and resource management.
 - **2025.12.1**: Added search support for [NCBI](https://www.ncbi.nlm.nih.gov/) and [RNAcentral](https://rnacentral.org/) databases, enabling extraction of DNA and RNA data from these bioinformatics databases.

diff --git a/README_zh.md b/README_zh.md
@@ -62,14 +62,15 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 在数据生成后，您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 和 [xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。
 
 ## 📌 最新功能
+- **2026.02.04**：支持使用直接读入 HuggingFace 数据集进行数据生成
 - **2026.01.15**：合成垂域评测数据（单选题、多选题、填空题和判断题型）🌟🌟
 - **2025.12.26**：引入知识图谱评估指标，包括准确度评估（实体/关系抽取质量）、一致性评估（冲突检测）和结构鲁棒性评估（噪声比、连通性、度分布）
-- **2025.12.16**：支持 [rocksdb](https://github.com/facebook/rocksdb) 作为键值存储后端, [kuzudb](https://github.com/kuzudb/kuzu) 作为图数据库后端
 
 
 <details>
 <summary>历史更新记录</summary>
 
+- **2025.12.16**：支持 [rocksdb](https://github.com/facebook/rocksdb) 作为键值存储后端, [kuzudb](https://github.com/kuzudb/kuzu) 作为图数据库后端。
 - **2025.12.16**：支持 [vllm](https://github.com/vllm-project/vllm) 作为本地推理后端。
 - **2025.12.16**：使用 [ray](https://github.com/ray-project/ray) 重构了数据生成 pipeline，提升了分布式执行和资源管理的效率。
 - **2025.12.1**：新增对 [NCBI](https://www.ncbi.nlm.nih.gov/) 和 [RNAcentral](https://rnacentral.org/) 数据库的检索支持，现在可以从这些生物信息学数据库中提取DNA和RNA数据。

diff --git a/examples/generate/generate_aggregated_qa/generate_aggregated_from_hf.sh b/examples/generate/generate_aggregated_qa/generate_aggregated_from_hf.sh
@@ -0,0 +1,2 @@
+python3 -m graphgen.run \
+--config_file examples/generate/generate_aggregated_qa/huggingface_config.yaml
diff --git a/examples/generate/generate_aggregated_qa/huggingface_config.yaml b/examples/generate/generate_aggregated_qa/huggingface_config.yaml
@@ -0,0 +1,83 @@
+global_params:
+  working_dir: cache
+  graph_backend: networkx # graph database backend, support: kuzu, networkx
+  kv_backend: json_kv # key-value store backend, support: rocksdb, json_kv
+
+nodes:
+  - id: read_hf_dataset # Read from Hugging Face Hub
+    op_name: read
+    type: source
+    dependencies: []
+    params:
+        input_path:
+          - huggingface://wikitext:wikitext-103-v1:train # Format: huggingface://dataset_name:subset:split
+        # Optional parameters for HuggingFaceReader:
+        text_column: text # Column name containing text content (default: content)
+        # cache_dir: /path/to/cache # Optional: directory to cache downloaded datasets
+        # trust_remote_code: false # Optional: whether to trust remote code in datasets
+
+  - id: chunk_documents
+    op_name: chunk
+    type: map_batch
+    dependencies:
+      - read_hf_dataset
+    execution_params:
+      replicas: 4
+    params:
+        chunk_size: 1024
+        chunk_overlap: 100
+
+  - id: build_kg
+    op_name: build_kg
+    type: map_batch
+    dependencies:
+      - chunk_documents
+    execution_params:
+      replicas: 1
+      batch_size: 128
+
+  - id: quiz
+    op_name: quiz
+    type: map_batch
+    dependencies:
+      - build_kg
+    execution_params:
+      replicas: 1
+      batch_size: 128
+    params:
+      quiz_samples: 2
+
+  - id: judge
+    op_name: judge
+    type: map_batch
+    dependencies:
+      - quiz
+    execution_params:
+      replicas: 1
+      batch_size: 128
+
+  - id: partition
+    op_name: partition
+    type: aggregate
+    dependencies:
+      - judge
+    params:
+      method: ece
+      method_params:
+        max_units_per_community: 20
+        min_units_per_community: 5
+        max_tokens_per_community: 10240
+        unit_sampling: max_loss
+
+  - id: generate
+    op_name: generate
+    type: map_batch
+    dependencies:
+      - partition
+    execution_params:
+      replicas: 1
+      batch_size: 128
+    save_output: true
+    params:
+      method: aggregated
+      data_format: ChatML
diff --git a/graphgen/models/__init__.py b/graphgen/models/__init__.py
@@ -33,6 +33,7 @@
     )
     from .reader import (
         CSVReader,
+        HuggingFaceReader,
         JSONReader,
         ParquetReader,
         PDFReader,
@@ -92,6 +93,7 @@
     "PickleReader": ".reader",
     "RDFReader": ".reader",
     "TXTReader": ".reader",
+    "HuggingFaceReader": ".reader",
     # Searcher
     "NCBISearch": ".searcher.db.ncbi_searcher",
     "RNACentralSearch": ".searcher.db.rnacentral_searcher",

diff --git a/graphgen/models/reader/__init__.py b/graphgen/models/reader/__init__.py
@@ -1,4 +1,5 @@
 from .csv_reader import CSVReader
+from .huggingface_reader import HuggingFaceReader
 from .json_reader import JSONReader
 from .parquet_reader import ParquetReader
 from .pdf_reader import PDFReader

diff --git a/graphgen/models/reader/huggingface_reader.py b/graphgen/models/reader/huggingface_reader.py
@@ -0,0 +1,201 @@
+"""
+Hugging Face Datasets Reader
+This module provides a reader for accessing datasets from Hugging Face Hub.
+"""
+
+from typing import TYPE_CHECKING, List, Optional, Union
+
+from graphgen.bases.base_reader import BaseReader
+
+if TYPE_CHECKING:
+    import numpy as np
+    import ray
+    from ray.data import Dataset
+
+
+class HuggingFaceReader(BaseReader):
+    """
+    Reader for Hugging Face Datasets.
+
+    Supports loading datasets from the Hugging Face Hub.
+    Can specify a dataset by name and optional subset/split.
+
+    Columns:
+        - type: The type of the document (e.g., "text", "image", etc.)
+        - if type is "text", "content" column must be present (or specify via text_column).
+
+    Example:
+        reader = HuggingFaceReader(text_column="text")
+        ds = reader.read("wikitext")
+        # or with split and subset
+        ds = reader.read("wikitext:wikitext-103-v1:train")
+    """
+
+    def __init__(
+        self,
+        text_column: str = "content",
+        modalities: Optional[list] = None,
+        cache_dir: Optional[str] = None,
+        trust_remote_code: bool = False,
+    ):
+        """
+        Initialize HuggingFaceReader.
+
+        :param text_column: Column name containing text content
+        :param modalities: List of supported modalities
+        :param cache_dir: Directory to cache downloaded datasets
+        :param trust_remote_code: Whether to trust remote code in datasets
+        """
+        super().__init__(text_column=text_column, modalities=modalities)
+        self.cache_dir = cache_dir
+        self.trust_remote_code = trust_remote_code
+
+    def read(
+        self,
+        input_path: Union[str, List[str]],
+        split: Optional[str] = None,
+        subset: Optional[str] = None,
+        streaming: bool = False,
+        limit: Optional[int] = None,
+    ) -> "Dataset":
+        """
+        Read dataset from Hugging Face Hub.
+
+        :param input_path: Dataset identifier(s) from Hugging Face Hub
+                          Format: "dataset_name" or "dataset_name:subset:split"
+                          Example: "wikitext" or "wikitext:wikitext-103-v1:train"
+        :param split: Specific split to load (overrides split in path)
+        :param subset: Specific subset/configuration to load (overrides subset in path)
+        :param streaming: Whether to stream the dataset instead of downloading
+        :param limit: Maximum number of samples to load
+        :return: Ray Dataset containing the data
+        """
+        try:
+            import datasets as hf_datasets
+        except ImportError as exc:
+            raise ImportError(
+                "The 'datasets' package is required to use HuggingFaceReader. "
+                "Please install it with: pip install datasets"
+            ) from exc
+
+        if isinstance(input_path, list):
+            # Handle multiple datasets
+            all_dss = []
+            for path in input_path:
+                ds = self._load_single_dataset(
+                    path,
+                    split=split,
+                    subset=subset,
+                    streaming=streaming,
+                    limit=limit,
+                    hf_datasets=hf_datasets,
+                )
+                all_dss.append(ds)
+
+            if len(all_dss) == 1:
+                combined_ds = all_dss[0]
+            else:
+                combined_ds = all_dss[0].union(*all_dss[1:])
+        else:
+            combined_ds = self._load_single_dataset(
+                input_path,
+                split=split,
+                subset=subset,
+                streaming=streaming,
+                limit=limit,
+                hf_datasets=hf_datasets,
+            )
+
+        # Validate and filter
+        combined_ds = combined_ds.map_batches(
+            self._validate_batch, batch_format="pandas"
+        )
+        combined_ds = combined_ds.filter(self._should_keep_item)
+
+        return combined_ds
+
+    def _load_single_dataset(
+        self,
+        dataset_path: str,
+        split: Optional[str] = None,
+        subset: Optional[str] = None,
+        streaming: bool = False,
+        limit: Optional[int] = None,
+        hf_datasets=None,
+    ) -> "Dataset":
+        """
+        Load a single dataset from Hugging Face Hub.
+
+        :param dataset_path: Dataset path, can include subset and split
+        :param split: Override split
+        :param subset: Override subset
+        :param streaming: Whether to stream
+        :param limit: Max samples
+        :param hf_datasets: Imported datasets module
+        :return: Ray Dataset
+        """
+        import numpy as np
+        import ray
+
+        # Parse dataset path format: "dataset_name:subset:split"
+        parts = dataset_path.split(":")
+        dataset_name = parts[0]
+        parsed_subset = parts[1] if len(parts) > 1 else None
+        parsed_split = parts[2] if len(parts) > 2 else None
+
+        # Override with explicit parameters
+        final_subset = subset or parsed_subset
+        final_split = split or parsed_split or "train"
+
+        # Load dataset from Hugging Face
+        load_kwargs = {
+            "cache_dir": self.cache_dir,
+            "trust_remote_code": self.trust_remote_code,
+            "streaming": streaming,
+        }
+
+        if final_subset:
+            load_kwargs["name"] = final_subset
+
+        hf_dataset = hf_datasets.load_dataset(
+            dataset_name, split=final_split, **load_kwargs
+        )
+
+        # Apply limit before converting to Ray dataset for memory efficiency
+        if limit:
+            if streaming:
+                hf_dataset = hf_dataset.take(limit)
+            else:
+                hf_dataset = hf_dataset.select(range(limit))
+
+        # Convert to Ray dataset using lazy evaluation
+        ray_ds = ray.data.from_huggingface(hf_dataset)
+
+        # Define batch processing function for lazy evaluation
+        def _process_batch(batch: dict[str, "np.ndarray"]) -> dict[str, "np.ndarray"]:
+            """
+            Process a batch of data to add type field and rename text column.
+
+            :param batch: A dictionary with column names as keys and numpy arrays
+            :return: Processed batch dictionary with numpy arrays
+            """
+            if not batch:
+                return {}
+
+            # Get the number of rows in the batch
+            num_rows = len(next(iter(batch.values())))
+
+            # Add type field if not present
+            if "type" not in batch:
+                batch["type"] = np.array(["text"] * num_rows)
+
+            # Rename text_column to 'content' if different
+            if self.text_column != "content" and self.text_column in batch:
+                batch["content"] = batch.pop(self.text_column)
+
+            return batch
+
+        # Apply post-processing using map_batches for distributed lazy evaluation
+        ray_ds = ray_ds.map_batches(_process_batch)
+
+        return ray_ds
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		python3 -m graphgen.run \
		--config_file examples/generate/generate_aggregated_qa/huggingface_config.yaml