Skip to content

Conversation

@dltamayo
Copy link
Collaborator

No description provided.

dltamayo and others added 5 commits December 15, 2025 14:15
Use awk instead of python to avoid reading files into memory
- `sample` is the unique identifier within samplesheet, no need to include other metadata columns for the purposes of clustering.
- reformatted sample_calc.py to only add sample as a metadata column; the rest of the metadata columns are now merged in the quarto notebook based on sample
- added adjustable parameters allowing user to specify which metadata columns to use for notebook charts
- cleaned up a lot of code within the notebook for aesthetics, reducing code redundancy
@github-actions
Copy link

github-actions bot commented Dec 17, 2025

Unit Test Results

10 tests   10 ✅  3m 6s ⏱️
 2 suites   0 💤
 1 files     0 ❌

Results for commit 8760a31.

♻️ This comment has been updated with latest results.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes hardcoded metadata assumptions from the TCRtoolkit pipeline, making it more flexible and configurable. The changes transition from a rigid metadata structure (subject_id, timepoint, origin) to a user-configurable approach where metadata columns are read from the sample table and plotting parameters can be specified via configuration.

Key Changes:

  • Replaced hardcoded metadata column names with configurable parameters throughout the pipeline
  • Refactored the sample statistics notebook to use dynamic metadata and plotting functions
  • Simplified data processing scripts to remove metadata handling and use only sample names

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
notebooks/sample_stats_template.qmd Converted from hardcoded metadata fields to configurable parameters; refactored plotting to use a reusable function; updated V gene family plotting to support dynamic grouping columns
nextflow.config Added new configuration parameters for sample chart x/color columns and V gene grouping
modules/local/sample/sample_plot.nf Updated to pass new configuration parameters to the Quarto rendering process
modules/local/sample/sample_calc.nf Simplified to pass only sample name instead of full metadata JSON
bin/sample_calc.py Removed metadata handling logic; now only uses sample name for output files
bin/compare_concatenate.py Removed hardcoded metadata column handling and patient ID construction
modules/local/compare/tcrsharing.nf Improved performance using vectorized pandas operations for sample mapping
modules/local/compare/gliph2.nf Updated to use 'sample' column instead of constructed 'patient' identifier
modules/local/compare/giana.nf Replaced Python script with more efficient awk-based header insertion
Comments suppressed due to low confidence (1)

modules/local/compare/gliph2.nf:36

  • The gliph2 R script is constructed with raw Nextflow parameters params.local_min_pvalue, params.simulation_depth, params.kmer_min_depth, and params.local_min_OVE inserted directly into the R code. If a caller can override these params, they can supply values like 0.001); system("malicious_command"); # so that the generated R script executes arbitrary OS commands via system() or otherwise alters control flow. These parameters should be strictly validated as numeric/structured values or safely quoted and parsed within R so that user-controlled input cannot change the R code structure.
    result <- turboGliph::gliph2(
        cdr3_sequences = df,
        result_folder = "./",
        lcminp = ${params.local_min_pvalue},
        sim_depth = ${params.simulation_depth},
        kmer_mindepth = ${params.kmer_min_depth},
        lcminove = ${params.local_min_OVE},

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# 4. Loading data
## reading sample metadata
meta = pd.read_csv(sample_table, sep=',')
meta_cols = meta.columns.tolist()[0:-1]
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta_cols calculation uses meta.columns.tolist()[0:-1] which excludes the last column. This assumes the last column is always the 'file' column that should be excluded. However, this assumption is fragile - if the column order in the sample_table CSV changes, this will break. Consider explicitly selecting metadata columns or excluding the 'file' column by name instead.

Suggested change
meta_cols = meta.columns.tolist()[0:-1]
meta_cols = [c for c in meta.columns if c != 'file']

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- add logic to parse when submitting one grouping metadata column versus multiple for vgene chart
@dltamayo dltamayo merged commit ce634b8 into main Dec 23, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants