Generalize to running rE2G data #1

kanghelen · 2025-10-28T17:55:00Z

Added scripts to adapt to additional file types (e.g. rE2G data).

Copilot

Pull Request Overview

This PR enhances the V2G pipeline to support different ABC prediction formats by adding dynamic column name detection and improving robustness of variant filtering logic. The key changes include handling ABC predictions with either "ABC.Score" or "Score" column names, adding a hasPIP parameter to conditionally filter variants based on PIP availability, and expanding hepatocyte-related cell type definitions.

Key Changes

Dynamic detection of ABC score column names ("ABC.Score" vs "Score") across multiple R scripts to support different ABC prediction formats
Addition of hasPIP parameter to conditionally apply PIP filtering only when PIP information is available
Enhanced hepatocyte cell type definitions in the grouped cell type table

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
snakemake/workflow/scripts/helper_functions.R	Fixed `strsplit` usage by adding `unlist()` to prevent issues with list concatenation
snakemake/workflow/scripts/group.peakoverlap.result.R	Added `hasPIP` parameter, debug print statements, and standardized `write.table` calls
snakemake/workflow/scripts/group.ABC.result.R	Added `hasPIP` parameter, dynamic ABC.Score column detection, changed 0 values to NA for better data handling
snakemake/workflow/scripts/get_credible_set_variants_info.R	Fixed parameter name to `variantList`, added dynamic ABC score column detection
snakemake/workflow/scripts/create_locus_window.R	Formatting improvements (added blank lines)
snakemake/workflow/scripts/combine_tables.R	Code formatting improvements and removed redundant condition check
snakemake/workflow/scripts/celltype.overlap.R	Added dynamic ABC score column detection
snakemake/workflow/scripts/CreateCS.R	Added `fill=T` parameter to handle irregular table structures
snakemake/workflow/Snakefile	Added `hasPIP` function and passed it to relevant rules; updated ABC overlap header generation
resources/grouped_celltype_table.txt	Expanded hepatocyte-related cell type names
resources/ABC_overlap.header	Updated header to match new ABC prediction format
README.md	Enhanced documentation with clarifications for ABCPRED and CelltypeTable configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-28T17:59:20Z

snakemake/workflow/scripts/group.ABC.result.R

-	make_option("--PIP", type="numeric", help="The PIP threshold to filter the variants"), 
+	make_option("--PIP", type="numeric", help="The PIP threshold to filter the variants"),
+	make_option("--hasPIP", type="logical"),
 	make_option("--helperFunctions", type="numeric")


The type for helperFunctions parameter should be "character" not "numeric", as it represents a file path string.

Suggested change

make_option("--helperFunctions", type="numeric")

make_option("--helperFunctions", type="character")

Copilot · 2025-10-28T17:59:20Z

snakemake/workflow/scripts/group.ABC.result.R


 # group by credible set and target gene,calculate the max cell type-specific ABC.
-ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, 0, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()
+ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()


Inconsistent handling of max() with NA values. Line 43 uses max(!!sym(cell_group_max_ABC)) without na.rm=TRUE, which will return NA if any values are NA. Line 49 uses max(!!sym(celltypes_in_cell_group), na.rm = TRUE). For consistency and correctness, line 43 should also include na.rm=TRUE.

Suggested change

ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()

ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC), na.rm=TRUE)) %>% ungroup()

Copilot · 2025-10-28T17:59:20Z

resources/ABC_overlap.header

@@ -1 +1 @@
-chr	start	end	variant	PosteriorProb	Disease	CredibleSet	chr	start	end	name	class	activity_base	TargetGene	TargetGeneTSS	TargetGeneExpression	TargetGenePromoterActivityQuantile	TargetGeneIsExpressed	distance	isSelfPromoter	powerlaw_contact	powerlaw_contact_reference	hic_contact	hic_contact_pl_scaled	hic_pseudocount	hic_contact_pl_scaled_adj	ABC.Score.Numerator	ABC.Score	powerlaw.Score.Numerator	powerlaw.Score	CellType
+chr	start	end	name	class	TargetGene	TargetGeneEnsemblID	TargetGeneTSS	isSelfPromoter	CellType	distanceToTSS.Featuren	ormalizedDNase_prom.Feature	3DContact.Feature	ABC.Score.Feature	numCandidateEnhGene.Feature	numTSSEnhGene.Feature	sumNearbyEnhancers.Feature	ubiquitousExpressedGene.Feature	Score


Missing tab character in the header. The header has "distanceToTSS.Featuren ormalizedDNase_prom.Feature" which should be "distanceToTSS.Feature NormalizedDNase_prom.Feature" (tab should separate "Feature" and "NormalizedDNase_prom.Feature").

Suggested change

chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Featuren ormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score

chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Feature NormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score

Copilot · 2025-10-28T17:59:20Z

snakemake/workflow/scripts/group.peakoverlap.result.R

+        print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:")
+        print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table)


Hardcoded column name "OnlyIntersectsWithHepatocytesPeak" in debug print statement. This should use the dynamic column name from the variable celltype_specific_colnames to work correctly for all cell groups. Replace with !!sym(celltype_specific_colnames) or similar dynamic reference.

Suggested change

print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:")

print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table)

print(paste0("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe ", celltype_specific_colnames, " column table summary:"))

print(combined.celltype.credibleset.peak %>% pull(!!sym(celltype_specific_colnames)) %>% table)

Copilot · 2025-10-28T17:59:21Z

snakemake/workflow/scripts/get_credible_set_variants_info.R

    celltype_col=paste0(cellgroup, "CellTypesWithEPInteractions")
-    celltype_specific_df_ABC_interactions <- cs2gene_df %>% select(CredibleSet, gene, !!sym(celltype_col)) %>% separate_rows(., !!sym(celltype_col), sep="\\|", convert = TRUE) 
+    celltype_specific_df_ABC_interactions <- cs2gene_df %>% select(CredibleSet, gene, !!sym(celltype_col)) %>% separate_rows(., !!sym(celltype_col), sep="\\|", convert = TRUE)
    celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>%  mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MacABC",cellgroup,"Variant")):=TRUE)


Typo in variable name: "MacABC" should be "MaxABC" for consistency with the naming pattern used elsewhere in the code (e.g., "MaxABC.Score", "MaxABC.Rank").

Suggested change

celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MacABC",cellgroup,"Variant")):=TRUE)

celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MaxABC",cellgroup,"Variant")):=TRUE)

Copilot · 2025-10-28T17:59:21Z

README.md

 ALLPEAKS: the absolute path to the directory containing [cell type-specific peaks](https://mitra.stanford.edu/engreitz/public/SchnitzlerKang2023/EnhancerList.minus150). 

-CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"
+CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"


Typo: "catagorizing" should be "categorizing".

Suggested change

CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"

CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for categorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"

Helen Yihua Kang added 2 commits October 28, 2025 10:47

adding scripts to run on rE2G files

88365ae

clean up interactive session file paths

a8c7107

kanghelen requested a review from Copilot October 28, 2025 17:55

Copilot AI reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generalize to running rE2G data #1

Generalize to running rE2G data #1

Uh oh!

kanghelen commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	make_option("--helperFunctions", type="numeric")
	make_option("--helperFunctions", type="character")

	ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup()
	ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC), na.rm=TRUE)) %>% ungroup()

		@@ -1 +1 @@
		chr start end variant PosteriorProb Disease CredibleSet chr start end name class activity_base TargetGene TargetGeneTSS TargetGeneExpression TargetGenePromoterActivityQuantile TargetGeneIsExpressed distance isSelfPromoter powerlaw_contact powerlaw_contact_reference hic_contact hic_contact_pl_scaled hic_pseudocount hic_contact_pl_scaled_adj ABC.Score.Numerator ABC.Score powerlaw.Score.Numerator powerlaw.Score CellType
		chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Featuren ormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score

		print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:")
		print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table)

	CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"
	CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for categorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"

Generalize to running rE2G data #1

Are you sure you want to change the base?

Generalize to running rE2G data #1

Uh oh!

Conversation

kanghelen commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants