Skip to content

feat: add interpro searcher#175

Open
ChenZiHong-Gavin wants to merge 1 commit intomainfrom
feat/interpro_searcher
Open

feat: add interpro searcher#175
ChenZiHong-Gavin wants to merge 1 commit intomainfrom
feat/interpro_searcher

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added the core label Feb 5, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data retrieval capabilities of the 'graphgen' project by introducing a dedicated InterPro searcher. This new component allows for comprehensive querying of protein domain architectures, Gene Ontology (GO) terms, and pathway annotations. By supporting both sequence-based analysis and direct UniProt ID lookups, it provides a flexible and powerful tool for integrating rich biological context into the system's graph generation processes.

Highlights

  • InterPro Searcher Implementation: A new 'InterProSearch' class has been added, enabling the system to query protein domain and functional annotation data from the InterPro database.
  • Dual Search Capability: The 'InterProSearch' supports two primary search modes: submitting protein sequences for InterProScan analysis via the EBI API, and directly querying pre-computed domain information using UniProt accession numbers.
  • Automatic Query Type Detection: The main 'search' method intelligently determines whether the input query is a UniProt ID or a protein sequence, routing it to the appropriate underlying search mechanism.
  • API Integration and Robustness: The implementation includes robust handling of EBI API interactions, featuring retry mechanisms ('tenacity') for network requests and polling logic to manage asynchronous job submissions.
  • Module Integration: The new searcher is integrated into the 'graphgen' model system and the 'SearchService' operator, making it accessible for use within the application by specifying 'interpro' as the data source.
Changelog
  • graphgen/models/init.py
    • Imported 'InterProSearch' class.
    • Added 'InterProSearch' to the '_module_lookup' dictionary for dynamic loading.
  • graphgen/models/searcher/db/interpro_searcher.py
    • New file implementing the 'InterProSearch' class.
    • Includes methods for submitting protein sequences to EBI InterProScan, polling job status, and retrieving/parsing results.
    • Provides functionality to search InterPro by UniProt accession numbers and extract domain, GO term, and pathway information.
    • Features utility methods for input validation and robust API communication with retry logic.
  • graphgen/operators/search/search_service.py
    • Modified '_init_searcher' method to dynamically load and initialize 'InterProSearch' when 'data_source' is 'interpro'.
Activity
  • No human activity (comments, reviews, etc.) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new searcher for the InterPro database. A critical security vulnerability, Server-Side Request Forgery (SSRF) / Path Traversal, was identified in the search_by_uniprot_id method due to unvalidated input in URL construction. Additionally, the InterProSearch class could benefit from improved flexibility and robustness, specifically by making polling parameters configurable and enhancing the resilience of search methods to network issues.

Comment on lines +365 to +367
url = f"https://www.ebi.ac.uk/interpro/api/entry/protein/uniprot/{accession}/"

response = requests.get(url, timeout=self.api_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The search_by_uniprot_id method is vulnerable to Server-Side Request Forgery (SSRF) / Path Traversal because it constructs a URL using the accession parameter without proper sanitization or validation. An attacker could manipulate the request sent to the EBI InterPro API. Additionally, this method lacks proper error handling and retry mechanisms, making it less resilient to transient network failures.

        # Ensure accession is safe for URL construction
import urllib.parse
safe_accession = urllib.parse.quote(accession.strip().upper(), safe='')

        # Query InterPro REST API for UniProt entry
url = f"https://www.ebi.ac.uk/interpro/api/entry/protein/uniprot/{safe_accession}/"

Comment on lines +29 to +45
def __init__(
self,
email: str = "graphgen@example.com",
api_timeout: int = 30,
):
"""
Initialize the InterPro Search client.

Args:
email (str): Email address for EBI API requests.
api_timeout (int): Request timeout in seconds.
"""
self.base_url = "https://www.ebi.ac.uk/Tools/services/rest/iprscan5"
self.email = email
self.api_timeout = api_timeout
self.poll_interval = 5 # Fixed interval between status checks
self.max_polls = 120 # Maximum polling attempts (10 minutes with 5s interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The __init__ method can be improved in a couple of ways for better flexibility and adherence to external API policies:

  1. Configurable Polling: The polling interval and maximum number of polls are currently hardcoded. Making these configurable would allow users to adjust them for sequences that might require longer analysis times.
  2. Default Email: EBI services recommend providing a valid email address. Using a default placeholder is not ideal. It's good practice to warn the user if the default email is being used to encourage compliance with the service's usage policy.

I've suggested an updated __init__ method that addresses both points.

    def __init__(
        self,
        email: str = "graphgen@example.com",
        api_timeout: int = 30,
        poll_interval: int = 5,
        max_polls: int = 120,
    ):
        """
        Initialize the InterPro Search client.

        Args:
            email (str): Email address for EBI API requests.
            api_timeout (int): Request timeout in seconds.
            poll_interval (int): Interval in seconds between status checks.
            max_polls (int): Maximum number of polling attempts.
        """
        self.base_url = "https://www.ebi.ac.uk/Tools/services/rest/iprscan5"
        self.email = email
        if self.email == "graphgen@example.com":
            logger.warning(
                "Using default email for InterProSearch. It is recommended to provide a valid email address for EBI services."
            )
        self.api_timeout = api_timeout
        self.poll_interval = poll_interval
        self.max_polls = max_polls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant