-
Notifications
You must be signed in to change notification settings - Fork 2
Remove hardcoded metadata, other improvements #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use awk instead of python to avoid reading files into memory
- `sample` is the unique identifier within samplesheet, no need to include other metadata columns for the purposes of clustering.
- reformatted sample_calc.py to only add sample as a metadata column; the rest of the metadata columns are now merged in the quarto notebook based on sample - added adjustable parameters allowing user to specify which metadata columns to use for notebook charts - cleaned up a lot of code within the notebook for aesthetics, reducing code redundancy
Merge from main
Unit Test Results10 tests 10 ✅ 3m 6s ⏱️ Results for commit 8760a31. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR removes hardcoded metadata assumptions from the TCRtoolkit pipeline, making it more flexible and configurable. The changes transition from a rigid metadata structure (subject_id, timepoint, origin) to a user-configurable approach where metadata columns are read from the sample table and plotting parameters can be specified via configuration.
Key Changes:
- Replaced hardcoded metadata column names with configurable parameters throughout the pipeline
- Refactored the sample statistics notebook to use dynamic metadata and plotting functions
- Simplified data processing scripts to remove metadata handling and use only sample names
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| notebooks/sample_stats_template.qmd | Converted from hardcoded metadata fields to configurable parameters; refactored plotting to use a reusable function; updated V gene family plotting to support dynamic grouping columns |
| nextflow.config | Added new configuration parameters for sample chart x/color columns and V gene grouping |
| modules/local/sample/sample_plot.nf | Updated to pass new configuration parameters to the Quarto rendering process |
| modules/local/sample/sample_calc.nf | Simplified to pass only sample name instead of full metadata JSON |
| bin/sample_calc.py | Removed metadata handling logic; now only uses sample name for output files |
| bin/compare_concatenate.py | Removed hardcoded metadata column handling and patient ID construction |
| modules/local/compare/tcrsharing.nf | Improved performance using vectorized pandas operations for sample mapping |
| modules/local/compare/gliph2.nf | Updated to use 'sample' column instead of constructed 'patient' identifier |
| modules/local/compare/giana.nf | Replaced Python script with more efficient awk-based header insertion |
Comments suppressed due to low confidence (1)
modules/local/compare/gliph2.nf:36
- The
gliph2R script is constructed with raw Nextflow parametersparams.local_min_pvalue,params.simulation_depth,params.kmer_min_depth, andparams.local_min_OVEinserted directly into the R code. If a caller can override these params, they can supply values like0.001); system("malicious_command"); #so that the generated R script executes arbitrary OS commands viasystem()or otherwise alters control flow. These parameters should be strictly validated as numeric/structured values or safely quoted and parsed within R so that user-controlled input cannot change the R code structure.
result <- turboGliph::gliph2(
cdr3_sequences = df,
result_folder = "./",
lcminp = ${params.local_min_pvalue},
sim_depth = ${params.simulation_depth},
kmer_mindepth = ${params.kmer_min_depth},
lcminove = ${params.local_min_OVE},
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
notebooks/sample_stats_template.qmd
Outdated
| # 4. Loading data | ||
| ## reading sample metadata | ||
| meta = pd.read_csv(sample_table, sep=',') | ||
| meta_cols = meta.columns.tolist()[0:-1] |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meta_cols calculation uses meta.columns.tolist()[0:-1] which excludes the last column. This assumes the last column is always the 'file' column that should be excluded. However, this assumption is fragile - if the column order in the sample_table CSV changes, this will break. Consider explicitly selecting metadata columns or excluding the 'file' column by name instead.
| meta_cols = meta.columns.tolist()[0:-1] | |
| meta_cols = [c for c in meta.columns if c != 'file'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
- add logic to parse when submitting one grouping metadata column versus multiple for vgene chart
No description provided.