-
Notifications
You must be signed in to change notification settings - Fork 12
Feat/methylation filtering #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds filtering of sex chromosome probes to the UMAP generation pipeline and generates lists of probes that are affected by SNPs or do not map to the genome. The changes enhance the methylation workflow by providing more granular control over probe filtering and making filtered probe lists available as outputs.
Key changes:
- Added sex chromosome probe filtering capability to the UMAP generation
- Generated and output lists of SNP-affected probes and non-genomic probes
- Implemented a batched concatenation mechanism for large probe lists
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/methylation/methylation-standard.wdl | Added new outputs and batched probe list concatenation logic |
| workflows/methylation/methylation-preprocess.wdl | Added task to list sex chromosome probes and updated outputs |
| workflows/methylation/methylation-cohort.wdl | Integrated sex probe filtering into the cohort workflow |
| workflows/methylation/CHANGELOG.md | Documented new probe list outputs |
| scripts/methylation/methylation-preprocess.R | Added logic to identify and output SNP-affected and non-genomic probes |
| scripts/methylation/list-sex-probes.R | New script to generate sex chromosome probe list |
| scripts/methylation/filter.py | Added support for excluding probes from additional file sources |
| scripts/CHANGELOG.md | Documented script changes |
| docker/pandas/package.json | Incremented revision for pandas container |
| docker/minfi/package.json | Incremented revision for minfi container |
| docker/minfi/Dockerfile | Added new list-sex-probes.R script to container |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| good_probes <- rownames(ann)[ann$chr == "chrX" | ann$chr == "chrY"] | ||
|
|
||
| write.table( | ||
| good_probes, |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name good_probes is misleading. These are sex chromosome probes, not necessarily 'good' probes. Consider renaming to sex_probes or sex_chromosome_probes for clarity.
| good_probes <- rownames(ann)[ann$chr == "chrX" | ann$chr == "chrY"] | |
| write.table( | |
| good_probes, | |
| sex_probes <- rownames(ann)[ann$chr == "chrX" | ann$chr == "chrY"] | |
| write.table( | |
| sex_probes, |
| if (probelist_length <= max_length){ | ||
| call concat_and_uniq as simple_merge { input: | ||
| files_to_combine = probe_files, | ||
| output_file_name = "probes_with_snps.csv", |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent file extension: this output file name uses .csv extension while the corresponding task in the probelist_length > max_length branch (line 73) uses .txt. The extension should be consistent across both branches, likely .txt to match the other branch.
| output_file_name = "probes_with_snps.csv", | |
| output_file_name = "probes_with_snps.txt", |
| if (non_genomic_probelist_length <= max_length){ | ||
| call concat_and_uniq as simple_merge_non_genomic { input: | ||
| files_to_combine = non_genomic_probe_list, | ||
| output_file_name = "non_genomic_probes.csv", |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent file extension: this output file name uses .csv extension while the corresponding task in the non_genomic_probelist_length > max_length branch (line 112) uses .txt. The extension should be consistent across both branches, likely .txt to match the other branch.
| output_file_name = "non_genomic_probes.csv", | |
| output_file_name = "non_genomic_probes.txt", |
| command <<< | ||
| set -euo pipefail | ||
| sort ~{sep(" ", quote(files_to_combine))} | uniq > "~{output_file_name}" |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort command without arguments may produce inconsistent results across different locales. Consider using sort -u instead of sort | uniq for better performance, or add LC_ALL=C to ensure consistent sorting behavior.
| sort ~{sep(" ", quote(files_to_combine))} | uniq > "~{output_file_name}" | |
| LC_ALL=C sort -u ~{sep(" ", quote(files_to_combine))} > "~{output_file_name}" |
Add filtering of sex chromosomes to the UMAP generation. Also generate a list of probes that have SNPs.
Before submitting this PR, please make sure:
scripts/ordocker/directories, please ensure any image versions have been incremented accordingly!