This workflow is intended for processing of raw Nanopore sequencing data
(pod5) using Dorado, ensuring that both, base calling and modified
bases are inferred, producing as final product aligned bam files per
samples.
It requires two main pipelines, given the different type of nodes used
for each. Basecalling uses a GPU node, all the steps downstream of
basecalling are run in a CPU node in a single pipeline,
pre-processing.
-
Basecalling: Given an input directory with the raw pod5 files, runs basecalling for canonical base pairs and also 5mC and 6mA modifications.
-
Demultiplexing: For each basecalled bam, demultiplex, based on the kit used.
-
Alignment: Basecalled bams for each barcode is aligned, sorted and indexed, and renamed to the sample name associated to the barcode.
-
Merge and sort: All the aligned bams for a given sample are merged and sorted.
-
6mA filtering and nucleosomes tag addition: The merged bam is filtered for 6mA and nucleosome tags are added using
ft add-nucleosomes. This last step is required for FIRE.
The pipeline is located in
/project/spott/dveracruz/Dorado_nanopore/workflow.
For a given run, we need the following parameters:
-
Output directory: Full path of the folder where to store the results (it will be created if not existant)
-
Input directory: Full path that contains the sub folders with pod5 raw files.
-
Kit name: Name of the kit used, necessary for correct demultiplexing.
-
barcodes: Barcodes used, currently it accepts 1 barcode -> 1 sample.
-
sample_names: Samples names to use, same length as barcodes.
This pipeline is based on snakemake.
This pipeline includes calls to the conda environment fiber_sq, mostly
for 2 key programs:
-
samtools=1.19.2 -
snakemake=7.32.4
The environment pacbio is used for modkit and ft tools.
A copy of the list of packages in these environments are in workflow/conda_env
Dorado version is dorado=0.4.3
For the models for basecalling & modification detection, the following are used:
-
model_base: dna_r10.4.1_e8.2_400bps_sup@v4.2.0
-
model_6mA: dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v2
-
model_5mC: dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2
A file like the following is needed to run the data. Including the input and output directories, the kit name, barcodes and sample names.
A copy is at workflow/run_samples.yaml used for the test example.
## Config file for Dorado analysis
in_dir: '/project/spott/dveracruz/Dorado_nanopore/test/raw'
out_dir: '/project/spott/dveracruz/Dorado_nanopore/test'
kit_name: "SQK-NBD114-24"
## Barcodes to demultiplex: 01 to 24.
## Barcodes used per sample, and sample names. as python lists, ['01','02','03'...,'24']
barcodes: ['09','10']
sample_names: ['EcoGII_GpC','EcoGII']
#barcodes: ['09']
#sample_names: ['EcoGII_GpC']
## FIXED PARAMETERS - DO NOT CHANGE UNLESS NEW DORADO VERSION OR MODELS
dorado_exec: '/project/spott/lizarraga/nanopore/dorado/dorado-0.4.3-linux-x64/bin/dorado'
## Dorado models.
dorado_models_dir: '/project/spott/lizarraga/nanopore/dorado/dorado_models'
model_base: 'dna_r10.4.1_e8.2_400bps_sup@v4.2.0'
model_6mA: 'dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v2'
model_5mC: 'dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2'
This step performs basecalling, including modified bases given the models for 5mC and 6mA from pod5 raw files. This step is run in a gpu node.
To run the basecalling step you can use the run_np_basecalling.sh
which will run snakemake and will keep the logs together.
If for some reason, this does not work, a script is included in
scripts/dorado_basecall.sh which should create the same structure as
the snakemake file.
This step includes demultiplexing, alignment, sorting, followed by
modkit filtering to omit the lower 10% Quantile of the 6mA and
nucleosome tags addition using ft add-nucleosomes.
To run the basecalling step you can use the run_np_preprocessing.sh
which will run snakemake and will keep the logs together.

