feat: support synthesizing masked fill_in_blank QA pairs by superfarther · Pull Request #173 · InternScience/GraphGen

superfarther · 2026-02-04T10:45:33Z

This PR support synthesizing masked fill_in_blank QA pairs

gemini-code-assist · 2026-02-04T10:45:53Z

Summary of Changes

Hello @superfarther, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's question generation capabilities by introducing a novel method for creating masked fill-in-blank QA pairs. It integrates a new generator that intelligently rephrases graph data and masks key entities, alongside a dedicated partitioner for extracting graph triples. The inclusion of a comprehensive example pipeline ensures immediate usability and demonstration of this new feature.

Highlights

New QA Generation Method: Introduced support for synthesizing masked fill-in-blank Question-Answering (QA) pairs, enhancing the variety of generative tasks.
MaskedFillInBlankGenerator: Added a new MaskedFillInBlankGenerator class that rephrases graph nodes and edges into coherent text, then randomly masks a node's name to create fill-in-blank questions.
TriplePartitioner: Implemented a TriplePartitioner to extract distinct (node, edge, node) triples from a graph, which serves as input for the new generator.
Example Pipeline: Provided a complete example, including a README, a shell script, and a YAML configuration, to demonstrate how to generate masked fill-in-blank QAs.
Output Format Support: Extended the base generator to support a new QA_pairs output data format for generated questions and answers.

Changelog

examples/generate/generate_masked_fill_in_blank_qa/README.md
- Added a new README file for the masked fill-in-blank QA generation example.
examples/generate/generate_masked_fill_in_blank_qa/generate_masked_fill_in_blank.sh
- Added a new shell script to execute the masked fill-in-blank QA generation pipeline.
examples/generate/generate_masked_fill_in_blank_qa/masked_fill_in_blank_config.yaml
- Added a new YAML configuration file to define the pipeline for generating masked fill-in-blank QA pairs.
graphgen/bases/base_generator.py
- Modified format_generation_results to support the new QA_pairs output format, returning question and answer as a dictionary.
graphgen/models/init.py
- Imported and registered MaskedFillInBlankGenerator and TriplePartitioner for dynamic loading within the graphgen.models module.
graphgen/models/generator/init.py
- Imported MaskedFillInBlankGenerator to make it available within the generator module.
graphgen/models/generator/masked_fill_in_blank_generator.py
- Added a new MaskedFillInBlankGenerator class that rephrases input graph nodes and edges into coherent text and then masks a randomly selected node to form a fill-in-blank question.
graphgen/models/partitioner/init.py
- Imported TriplePartitioner to make it available within the partitioner module.
graphgen/models/partitioner/triple_partitioner.py
- Added a new TriplePartitioner class that partitions a graph into distinct (node, edge, node) triples, ignoring isolated points and yielding triples in BFS order.
graphgen/operators/generate/generate_service.py
- Integrated the MaskedFillInBlankGenerator into the GenerateService to handle the masked_fill_in_blank generation method.
graphgen/operators/partition/partition_service.py
- Integrated the TriplePartitioner into the PartitionService to handle the triple partitioning method.
- Made method_params optional in the PartitionService constructor to accommodate partitioners that may not require them.

Activity

The pull request introduces a new feature to generate masked fill-in-blank QA pairs.
New example files, including a README, a shell script, and a configuration file, have been added to demonstrate the new functionality.
Core logic for the MaskedFillInBlankGenerator and TriplePartitioner has been implemented.
Existing service classes (GenerateService and PartitionService) have been updated to integrate the new generator and partitioner.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for synthesizing masked fill-in-the-blank QA pairs. It adds a new TriplePartitioner to break down the graph into triples and a MaskedFillInBlankGenerator for generating the QA pairs. The overall approach is sound and the changes are well-structured. I've identified one critical issue that could lead to a runtime error, along with several high-severity issues concerning violations of base class contracts and the use of global state, which should be addressed. I've also included some medium-severity suggestions to enhance code quality and maintainability.

gemini-code-assist · 2026-02-04T10:48:12Z

graphgen/models/generator/masked_fill_in_blank_generator.py

+        mask_pattern = re.compile(re.escape(mask_node_name), re.IGNORECASE)
+        masked_context = mask_pattern.sub("___", context)
+        # For accuracy, extract the actual replaced text from the context as the ground truth (keeping the original case)
+        gth = re.search(mask_pattern, context).group(0)


The call to re.search(mask_pattern, context) can return None if the mask_node_name is not found in the context. This would cause a crash with an AttributeError when .group(0) is called. You should add a check to handle this case gracefully, for example by logging a warning and returning an empty list.

Suggested change

gth = re.search(mask_pattern, context).group(0)

match = re.search(mask_pattern, context)

if not match:

logger.warning(

"Could not find mask_node_name '%s' in the rephrased context. Context: %s",

mask_node_name,

context,

)

return []

gth = match.group(0)

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/generator/masked_fill_in_blank_generator.py

+from graphgen.templates import AGGREGATED_GENERATION_PROMPT
+from graphgen.utils import detect_main_language, logger
+
+random.seed(42)


Setting a global random seed with random.seed(42) is generally discouraged as it affects the entire application's random number generation, which can lead to unexpected behavior in other parts of the code. For reproducibility, it's better to create a local random.Random instance within your class, for example in the __init__ method, and use that for random operations like random.choice on line 103.

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/generator/masked_fill_in_blank_generator.py

+    @staticmethod
+    def parse_response(response: str) -> dict:
+        pass


The parse_response method is defined as an abstract method in the BaseGenerator class but is implemented with pass here. Additionally, the return type hint dict is incompatible with the base class's list[dict]. Since this method is not used in the overridden generate method, it should either be implemented correctly or raise NotImplementedError to adhere to the abstract base class contract.

Suggested change

@staticmethod

def parse_response(response: str) -> dict:

pass

@staticmethod

def parse_response(response: str) -> list[dict]:

raise NotImplementedError("This method is not used in MaskedFillInBlankGenerator as it overrides the `generate` method.")

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/partitioner/triple_partitioner.py

+from graphgen.bases import BaseGraphStorage, BasePartitioner
+from graphgen.bases.datatypes import Community
+
+random.seed(42)


Setting a global random seed with random.seed(42) is generally discouraged as it affects the entire application's random number generation. This can lead to unexpected behavior in other parts of the code. For reproducibility, it's better to create a local random.Random instance within your class, for example in the __init__ method, and use that for random operations like random.shuffle.

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/partitioner/triple_partitioner.py

+    def partition(
+        self,
+        g: BaseGraphStorage,
+        **kwargs: Any,
+    ) -> Iterable[Community]:
+        nodes = [n[0] for n in g.get_all_nodes()]
+        random.shuffle(nodes)
+
+        visited_nodes: Set[str] = set()
+        used_edges: Set[frozenset[str]] = set()
+
+        for seed in nodes:
+            if seed in visited_nodes:
+                continue
+
+            # start BFS in a connected component
+            queue = deque([seed])
+            visited_nodes.add(seed)
+
+            while queue:
+                u = queue.popleft()
+
+                for v in g.get_neighbors(u):
+                    edge_key = frozenset((u, v))
+
+                    # if this edge has not been used, a new triple has been found
+                    if edge_key not in used_edges:
+                        used_edges.add(edge_key)
+
+                        # use the edge name to ensure the uniqueness of the ID
+                        u_sorted, v_sorted = sorted((u, v))
+                        yield Community(
+                            id=f"{u_sorted}-{v_sorted}",
+                            nodes=[u_sorted, v_sorted],
+                            edges=[(u_sorted, v_sorted)]
+                        )
+
+                    # continue to BFS
+                    if v not in visited_nodes:
+                        visited_nodes.add(v)
+                        queue.append(v)


The partition method in the base class BasePartitioner is type-hinted to return a List[Community]. This implementation returns an Iterable[Community] because it uses yield, which violates the Liskov Substitution Principle and the base class contract. To fix this, you should change this function to build and return a list of communities instead of using yield.

gemini-code-assist · 2026-02-04T10:48:13Z

examples/generate/generate_masked_fill_in_blank_qa/masked_fill_in_blank_config.yaml

+    dependencies:
+      - build_kg
+    params:
+      method: triple 


There is a trailing whitespace character on this line. While YAML is often tolerant of this, it's best practice to remove it to avoid potential parsing issues and maintain a clean configuration file.

method: triple

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/bases/base_generator.py

+
+        if output_data_format == "QA_pairs":
+            return {
+                "question": question, 


There is a trailing whitespace after question,. Please remove it to maintain code style consistency.

Suggested change

"question": question,

"question": question,

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/generator/masked_fill_in_blank_generator.py

+        # TODO: configure add_context
+        #     if add_context:
+        #         original_ids = [
+        #             node["source_id"].split("<SEP>")[0] for node in _process_nodes
+        #         ] + [edge[2]["source_id"].split("<SEP>")[0] for edge in _process_edges]
+        #         original_ids = list(set(original_ids))
+        #         original_text = await text_chunks_storage.get_by_ids(original_ids)
+        #         original_text = "\n".join(
+        #             [
+        #                 f"{index + 1}. {text['content']}"
+        #                 for index, text in enumerate(original_text)
+        #             ]
+        #         )


This large block of commented-out code seems to be related to a future feature (add_context). It's better to remove commented-out code from the codebase to improve readability. If this logic is needed for future reference, it should be tracked in an issue or a separate branch.

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/generator/masked_fill_in_blank_generator.py

+        if not context:
+            return []
+
+        nodes, edge = batch


The variable edge is used to unpack the second item from batch, but based on the assertion on line 100 (len(edge) == 1), it appears to be a list of edges. For clarity and to avoid confusion, consider renaming it to edges here and on line 100.

Suggested change

nodes, edge = batch

nodes, edges = batch

gemini-code-assist · 2026-02-04T10:48:13Z

graphgen/models/partitioner/triple_partitioner.py

+                    # continue to BFS
+                    if v not in visited_nodes:
+                        visited_nodes.add(v)
+                        queue.append(v)


The file is missing a newline character at the end. It's a common convention to end files with a newline to prevent issues with file concatenation and some version control tools.

Suggested change

queue.append(v)

queue.append(v)

feat: support synthesizing masked fill_in_blank QA pairs

64547af

github-actions bot added documentation Improvements or additions to documentation core examples labels Feb 4, 2026

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

style: fix formatting issues

4c1c32a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support synthesizing masked fill_in_blank QA pairs#173

feat: support synthesizing masked fill_in_blank QA pairs#173
superfarther wants to merge 2 commits intoInternScience:mainfrom
superfarther:yzh/masked_fill_in_blank

superfarther commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        gth = re.search(mask_pattern, context).group(0)
+        match = re.search(mask_pattern, context)
+        if not match:
+            logger.warning(
+                "Could not find mask_node_name '%s' in the rephrased context. Context: %s",
+                mask_node_name,
+                context,
+            )
+            return []
+        gth = match.group(0)

Conversation

superfarther commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant