Skip to content

Conversation

@kanghelen
Copy link
Collaborator

Added scripts to adapt to additional file types (e.g. rE2G data).

@kanghelen kanghelen requested a review from Copilot October 28, 2025 17:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the V2G pipeline to support different ABC prediction formats by adding dynamic column name detection and improving robustness of variant filtering logic. The key changes include handling ABC predictions with either "ABC.Score" or "Score" column names, adding a hasPIP parameter to conditionally filter variants based on PIP availability, and expanding hepatocyte-related cell type definitions.

Key Changes

  • Dynamic detection of ABC score column names ("ABC.Score" vs "Score") across multiple R scripts to support different ABC prediction formats
  • Addition of hasPIP parameter to conditionally apply PIP filtering only when PIP information is available
  • Enhanced hepatocyte cell type definitions in the grouped cell type table

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
snakemake/workflow/scripts/helper_functions.R Fixed strsplit usage by adding unlist() to prevent issues with list concatenation
snakemake/workflow/scripts/group.peakoverlap.result.R Added hasPIP parameter, debug print statements, and standardized write.table calls
snakemake/workflow/scripts/group.ABC.result.R Added hasPIP parameter, dynamic ABC.Score column detection, changed 0 values to NA for better data handling
snakemake/workflow/scripts/get_credible_set_variants_info.R Fixed parameter name to variantList, added dynamic ABC score column detection
snakemake/workflow/scripts/create_locus_window.R Formatting improvements (added blank lines)
snakemake/workflow/scripts/combine_tables.R Code formatting improvements and removed redundant condition check
snakemake/workflow/scripts/celltype.overlap.R Added dynamic ABC score column detection
snakemake/workflow/scripts/CreateCS.R Added fill=T parameter to handle irregular table structures
snakemake/workflow/Snakefile Added hasPIP function and passed it to relevant rules; updated ABC overlap header generation
resources/grouped_celltype_table.txt Expanded hepatocyte-related cell type names
resources/ABC_overlap.header Updated header to match new ABC prediction format
README.md Enhanced documentation with clarifications for ABCPRED and CelltypeTable configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

make_option("--PIP", type="numeric", help="The PIP threshold to filter the variants"),
make_option("--PIP", type="numeric", help="The PIP threshold to filter the variants"),
make_option("--hasPIP", type="logical"),
make_option("--helperFunctions", type="numeric")
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type for helperFunctions parameter should be "character" not "numeric", as it represents a file path string.

Suggested change
make_option("--helperFunctions", type="numeric")
make_option("--helperFunctions", type="character")

Copilot uses AI. Check for mistakes.

# group by credible set and target gene,calculate the max cell type-specific ABC.
ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, 0, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()
ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent handling of max() with NA values. Line 43 uses max(!!sym(cell_group_max_ABC)) without na.rm=TRUE, which will return NA if any values are NA. Line 49 uses max(!!sym(celltypes_in_cell_group), na.rm = TRUE). For consistency and correctness, line 43 should also include na.rm=TRUE.

Suggested change
ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()
ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC), na.rm=TRUE)) %>% ungroup()

Copilot uses AI. Check for mistakes.
@@ -1 +1 @@
chr start end variant PosteriorProb Disease CredibleSet chr start end name class activity_base TargetGene TargetGeneTSS TargetGeneExpression TargetGenePromoterActivityQuantile TargetGeneIsExpressed distance isSelfPromoter powerlaw_contact powerlaw_contact_reference hic_contact hic_contact_pl_scaled hic_pseudocount hic_contact_pl_scaled_adj ABC.Score.Numerator ABC.Score powerlaw.Score.Numerator powerlaw.Score CellType
chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Featuren ormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tab character in the header. The header has "distanceToTSS.Featuren ormalizedDNase_prom.Feature" which should be "distanceToTSS.Feature NormalizedDNase_prom.Feature" (tab should separate "Feature" and "NormalizedDNase_prom.Feature").

Suggested change
chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Featuren ormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score
chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Feature NormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +61
print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:")
print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table)
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded column name "OnlyIntersectsWithHepatocytesPeak" in debug print statement. This should use the dynamic column name from the variable celltype_specific_colnames to work correctly for all cell groups. Replace with !!sym(celltype_specific_colnames) or similar dynamic reference.

Suggested change
print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:")
print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table)
print(paste0("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe ", celltype_specific_colnames, " column table summary:"))
print(combined.celltype.credibleset.peak %>% pull(!!sym(celltype_specific_colnames)) %>% table)

Copilot uses AI. Check for mistakes.
celltype_col=paste0(cellgroup, "CellTypesWithEPInteractions")
celltype_specific_df_ABC_interactions <- cs2gene_df %>% select(CredibleSet, gene, !!sym(celltype_col)) %>% separate_rows(., !!sym(celltype_col), sep="\\|", convert = TRUE)
celltype_specific_df_ABC_interactions <- cs2gene_df %>% select(CredibleSet, gene, !!sym(celltype_col)) %>% separate_rows(., !!sym(celltype_col), sep="\\|", convert = TRUE)
celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MacABC",cellgroup,"Variant")):=TRUE)
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in variable name: "MacABC" should be "MaxABC" for consistency with the naming pattern used elsewhere in the code (e.g., "MaxABC.Score", "MaxABC.Rank").

Suggested change
celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MacABC",cellgroup,"Variant")):=TRUE)
celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MaxABC",cellgroup,"Variant")):=TRUE)

Copilot uses AI. Check for mistakes.
ALLPEAKS: the absolute path to the directory containing [cell type-specific peaks](https://mitra.stanford.edu/engreitz/public/SchnitzlerKang2023/EnhancerList.minus150).
CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"
CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "catagorizing" should be "categorizing".

Suggested change
CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"
CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for categorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants