Skip to content

Conversation

@adthrasher
Copy link
Member

@adthrasher adthrasher commented Dec 9, 2025

Add filtering of sex chromosomes to the UMAP generation. Also generate a list of probes that have SNPs.

Before submitting this PR, please make sure:

  • You have added a few sentences describing the PR here.
  • The code passes all CI tests without any errors or warnings.
  • You have added tests (when appropriate).
  • You have added an entry in any relevant CHANGELOGs (when appropriate).
  • If you have made any changes to the scripts/ or docker/ directories, please ensure any image versions have been incremented accordingly!
  • You have updated the README or other documentation to account for these changes (when appropriate).

@adthrasher adthrasher marked this pull request as ready for review December 16, 2025 21:15
@adthrasher adthrasher self-assigned this Dec 16, 2025
@adthrasher adthrasher added the enhancement New feature or request label Dec 16, 2025
@adthrasher adthrasher requested a review from Copilot December 19, 2025 18:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds filtering of sex chromosome probes to the UMAP generation pipeline and generates lists of probes that are affected by SNPs or do not map to the genome. The changes enhance the methylation workflow by providing more granular control over probe filtering and making filtered probe lists available as outputs.

Key changes:

  • Added sex chromosome probe filtering capability to the UMAP generation
  • Generated and output lists of SNP-affected probes and non-genomic probes
  • Implemented a batched concatenation mechanism for large probe lists

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
workflows/methylation/methylation-standard.wdl Added new outputs and batched probe list concatenation logic
workflows/methylation/methylation-preprocess.wdl Added task to list sex chromosome probes and updated outputs
workflows/methylation/methylation-cohort.wdl Integrated sex probe filtering into the cohort workflow
workflows/methylation/CHANGELOG.md Documented new probe list outputs
scripts/methylation/methylation-preprocess.R Added logic to identify and output SNP-affected and non-genomic probes
scripts/methylation/list-sex-probes.R New script to generate sex chromosome probe list
scripts/methylation/filter.py Added support for excluding probes from additional file sources
scripts/CHANGELOG.md Documented script changes
docker/pandas/package.json Incremented revision for pandas container
docker/minfi/package.json Incremented revision for minfi container
docker/minfi/Dockerfile Added new list-sex-probes.R script to container

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +7 to +10
good_probes <- rownames(ann)[ann$chr == "chrX" | ann$chr == "chrY"]

write.table(
good_probes,
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name good_probes is misleading. These are sex chromosome probes, not necessarily 'good' probes. Consider renaming to sex_probes or sex_chromosome_probes for clarity.

Suggested change
good_probes <- rownames(ann)[ann$chr == "chrX" | ann$chr == "chrY"]
write.table(
good_probes,
sex_probes <- rownames(ann)[ann$chr == "chrX" | ann$chr == "chrY"]
write.table(
sex_probes,

Copilot uses AI. Check for mistakes.
if (probelist_length <= max_length){
call concat_and_uniq as simple_merge { input:
files_to_combine = probe_files,
output_file_name = "probes_with_snps.csv",
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent file extension: this output file name uses .csv extension while the corresponding task in the probelist_length > max_length branch (line 73) uses .txt. The extension should be consistent across both branches, likely .txt to match the other branch.

Suggested change
output_file_name = "probes_with_snps.csv",
output_file_name = "probes_with_snps.txt",

Copilot uses AI. Check for mistakes.
if (non_genomic_probelist_length <= max_length){
call concat_and_uniq as simple_merge_non_genomic { input:
files_to_combine = non_genomic_probe_list,
output_file_name = "non_genomic_probes.csv",
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent file extension: this output file name uses .csv extension while the corresponding task in the non_genomic_probelist_length > max_length branch (line 112) uses .txt. The extension should be consistent across both branches, likely .txt to match the other branch.

Suggested change
output_file_name = "non_genomic_probes.csv",
output_file_name = "non_genomic_probes.txt",

Copilot uses AI. Check for mistakes.
command <<<
set -euo pipefail
sort ~{sep(" ", quote(files_to_combine))} | uniq > "~{output_file_name}"
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort command without arguments may produce inconsistent results across different locales. Consider using sort -u instead of sort | uniq for better performance, or add LC_ALL=C to ensure consistent sorting behavior.

Suggested change
sort ~{sep(" ", quote(files_to_combine))} | uniq > "~{output_file_name}"
LC_ALL=C sort -u ~{sep(" ", quote(files_to_combine))} > "~{output_file_name}"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants