-
Notifications
You must be signed in to change notification settings - Fork 1
Generalize to running rE2G data #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the V2G pipeline to support different ABC prediction formats by adding dynamic column name detection and improving robustness of variant filtering logic. The key changes include handling ABC predictions with either "ABC.Score" or "Score" column names, adding a hasPIP parameter to conditionally filter variants based on PIP availability, and expanding hepatocyte-related cell type definitions.
Key Changes
- Dynamic detection of ABC score column names ("ABC.Score" vs "Score") across multiple R scripts to support different ABC prediction formats
- Addition of
hasPIPparameter to conditionally apply PIP filtering only when PIP information is available - Enhanced hepatocyte cell type definitions in the grouped cell type table
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| snakemake/workflow/scripts/helper_functions.R | Fixed strsplit usage by adding unlist() to prevent issues with list concatenation |
| snakemake/workflow/scripts/group.peakoverlap.result.R | Added hasPIP parameter, debug print statements, and standardized write.table calls |
| snakemake/workflow/scripts/group.ABC.result.R | Added hasPIP parameter, dynamic ABC.Score column detection, changed 0 values to NA for better data handling |
| snakemake/workflow/scripts/get_credible_set_variants_info.R | Fixed parameter name to variantList, added dynamic ABC score column detection |
| snakemake/workflow/scripts/create_locus_window.R | Formatting improvements (added blank lines) |
| snakemake/workflow/scripts/combine_tables.R | Code formatting improvements and removed redundant condition check |
| snakemake/workflow/scripts/celltype.overlap.R | Added dynamic ABC score column detection |
| snakemake/workflow/scripts/CreateCS.R | Added fill=T parameter to handle irregular table structures |
| snakemake/workflow/Snakefile | Added hasPIP function and passed it to relevant rules; updated ABC overlap header generation |
| resources/grouped_celltype_table.txt | Expanded hepatocyte-related cell type names |
| resources/ABC_overlap.header | Updated header to match new ABC prediction format |
| README.md | Enhanced documentation with clarifications for ABCPRED and CelltypeTable configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| make_option("--PIP", type="numeric", help="The PIP threshold to filter the variants"), | ||
| make_option("--PIP", type="numeric", help="The PIP threshold to filter the variants"), | ||
| make_option("--hasPIP", type="logical"), | ||
| make_option("--helperFunctions", type="numeric") |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type for helperFunctions parameter should be "character" not "numeric", as it represents a file path string.
| make_option("--helperFunctions", type="numeric") | |
| make_option("--helperFunctions", type="character") |
|
|
||
| # group by credible set and target gene,calculate the max cell type-specific ABC. | ||
| ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, 0, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup() | ||
| ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup() |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent handling of max() with NA values. Line 43 uses max(!!sym(cell_group_max_ABC)) without na.rm=TRUE, which will return NA if any values are NA. Line 49 uses max(!!sym(celltypes_in_cell_group), na.rm = TRUE). For consistency and correctness, line 43 should also include na.rm=TRUE.
| ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC))) %>% ungroup() | |
| ABC.grouped.cellgroup <- ABC.grouped.cellgroup %>% group_by(CredibleSet, TargetGene) %>% mutate(!!sym(cell_group_max_ABC):=ifelse(!!sym(opt$cellGroup)==0, NA, !!sym(cell_group_max_ABC))) %>% mutate(!!sym(cell_group_max_ABC):=max(!!sym(cell_group_max_ABC), na.rm=TRUE)) %>% ungroup() |
| @@ -1 +1 @@ | |||
| chr start end variant PosteriorProb Disease CredibleSet chr start end name class activity_base TargetGene TargetGeneTSS TargetGeneExpression TargetGenePromoterActivityQuantile TargetGeneIsExpressed distance isSelfPromoter powerlaw_contact powerlaw_contact_reference hic_contact hic_contact_pl_scaled hic_pseudocount hic_contact_pl_scaled_adj ABC.Score.Numerator ABC.Score powerlaw.Score.Numerator powerlaw.Score CellType | |||
| chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Featuren ormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score | |||
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing tab character in the header. The header has "distanceToTSS.Featuren ormalizedDNase_prom.Feature" which should be "distanceToTSS.Feature NormalizedDNase_prom.Feature" (tab should separate "Feature" and "NormalizedDNase_prom.Feature").
| chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Featuren ormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score | |
| chr start end name class TargetGene TargetGeneEnsemblID TargetGeneTSS isSelfPromoter CellType distanceToTSS.Feature NormalizedDNase_prom.Feature 3DContact.Feature ABC.Score.Feature numCandidateEnhGene.Feature numTSSEnhGene.Feature sumNearbyEnhancers.Feature ubiquitousExpressedGene.Feature Score |
| print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:") | ||
| print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table) |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded column name "OnlyIntersectsWithHepatocytesPeak" in debug print statement. This should use the dynamic column name from the variable celltype_specific_colnames to work correctly for all cell groups. Replace with !!sym(celltype_specific_colnames) or similar dynamic reference.
| print("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe OnlyIntersectsWithHepatocytesPeak column table summary:") | |
| print(combined.celltype.credibleset.peak %>% pull(OnlyIntersectsWithHepatocytesPeak) %>% table) | |
| print(paste0("group.peakoverlap.result.R, combined.celltype.credibleset.peak dataframe ", celltype_specific_colnames, " column table summary:")) | |
| print(combined.celltype.credibleset.peak %>% pull(!!sym(celltype_specific_colnames)) %>% table) |
| celltype_col=paste0(cellgroup, "CellTypesWithEPInteractions") | ||
| celltype_specific_df_ABC_interactions <- cs2gene_df %>% select(CredibleSet, gene, !!sym(celltype_col)) %>% separate_rows(., !!sym(celltype_col), sep="\\|", convert = TRUE) | ||
| celltype_specific_df_ABC_interactions <- cs2gene_df %>% select(CredibleSet, gene, !!sym(celltype_col)) %>% separate_rows(., !!sym(celltype_col), sep="\\|", convert = TRUE) | ||
| celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MacABC",cellgroup,"Variant")):=TRUE) |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in variable name: "MacABC" should be "MaxABC" for consistency with the naming pattern used elsewhere in the code (e.g., "MaxABC.Score", "MaxABC.Rank").
| celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MacABC",cellgroup,"Variant")):=TRUE) | |
| celltype_specific_df_ABC_interactions <- merge(celltype_specific_df_ABC_interactions,ABC_overlap_df,by.x=c("CredibleSet",celltype_col, "gene"), by.y=c("CredibleSet", "CellType", "TargetGene")) %>% group_by(CredibleSet, gene) %>% mutate(ABC.rank = dense_rank(-ABC.Score)) %>% ungroup() %>% filter(ABC.rank==1) %>% rename(!!paste0(cellgroup, ".ABC.Score"):=ABC.Score) %>% select(-ABC.rank) %>% as.data.frame()%>% mutate(!!sym(paste0("MaxABC",cellgroup,"Variant")):=TRUE) |
| ALLPEAKS: the absolute path to the directory containing [cell type-specific peaks](https://mitra.stanford.edu/engreitz/public/SchnitzlerKang2023/EnhancerList.minus150). | ||
| CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt" | ||
| CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt" |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "catagorizing" should be "categorizing".
| CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt" | |
| CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. Please include names for all target cell types in this table. The table is for categorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt" |
Added scripts to adapt to additional file types (e.g. rE2G data).