Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ name: CI

on:
push:
branches: [main, dev]
branches: [main]
pull_request:
branches: [main, dev]
branches: [main]

jobs:
Formatting:
Expand Down Expand Up @@ -47,7 +47,7 @@ jobs:
with:
directory: .test
snakefile: workflow/Snakefile
args: "--sdm conda --show-failed-logs --cores 1 --conda-cleanup-pkgs cache -n"
args: "--sdm conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache"

- name: Test report
uses: snakemake/snakemake-github-action@v2.0.0
Expand Down
22 changes: 21 additions & 1 deletion .test/config/config.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,29 @@
samplesheet: "config/samples.csv"
outdir: "results"
tool: ["prokka"]

pgap:
bin: "path/to/pgap.py"
use_yaml_config: True
prepare_yaml_files:
generic: "config/generic.yaml"
submol: "config/submol.yaml"

prokka:
center: ""
extra: "--addgenes"

bakta:
download_db: "light"
existing_db: ""
extra: "--keep-contig-headers --compliant"

quast:
reference_fasta: ""
reference_gff: ""
extra: ""

panaroo:
skip: False
remove_source: "cmsearch"
remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
extra: "--clean-mode strict --remove-invalid-genes"
2 changes: 1 addition & 1 deletion .test/config/samples.csv
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
sample,species,strain,id_prefix,file
EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"
EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"
62 changes: 57 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,39 @@
[![GitHub actions status](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with apptainer](https://img.shields.io/badge/run%20with-apptainer-1D355C.svg?labelColor=000000)](https://apptainer.org/)
[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing)

A Snakemake workflow for the post-processing of microbial genome assemblies.

- [snakemake-assembly-postprocessing](#snakemake-assembly-postprocessing)
- [Usage](#usage)
- [Workflow overview](#workflow-overview)
- [Installation](#installation)
- [Deployment options](#deployment-options)
- [Authors](#authors)
- [References](#references)

## Usage

The usage of this workflow is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing).

Detailed information about input data and workflow configuration can also be found in the [`config/README.md`](config/README.md).

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository.

## Workflow overview
_Workflow overview:_

1. Parse `samples.csv` table containing the samples's meta data (`python`)
2. Annotate assemblies using NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap))
<img src="resources/images/dag.svg" align="center" />

## Requirements
## Workflow overview

- [PGAP](https://github.com/ncbi/pgap)
1. Parse `samples.csv` table containing the samples's meta data (`python`)
2. Annotate assemblies using one of the following tools:
1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)

## Installation

Expand All @@ -46,9 +62,37 @@ conda activate snakemake-assembly-postprocessing

**Step 4: Install PGAP**

- if you want to use [PGAP](https://github.com/ncbi/pgap) for annotation, it needs to be installed separately
- PGAP can be downloaded from https://github.com/ncbi/pgap. Please follow the installation instructions there.
- Define the path to the `pgap.py` script (located in the `scripts` folder) in the `config` file (recommended: `./resources`)

## Deployment options

To run the workflow from command line, change the working directory.

```bash
cd snakemake-assembly-postprocessing
```

Adjust options in the default config file `config/config.yml`.
Before running the complete workflow, you can perform a dry run using:

```bash
snakemake --cores 1 --dry-run
```

To run the workflow with test files using **conda**:

```bash
snakemake --cores 2 --sdm conda --directory .test
```

To run the workflow with test files using **apptainer**:

```bash
snakemake --cores 2 --sdm conda apptainer --directory .test
```

## Authors

- Dr. Rina Ahmed-Begrich
Expand All @@ -61,6 +105,14 @@ conda activate snakemake-assembly-postprocessing

## References

> Seemann T. _Prokka: rapid prokaryotic genome annotation_. Bioinformatics. **2014** Jul 15;30(14):2068-9. PMID: 24642063. https://doi.org/10.1093/bioinformatics/btu153.

> Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. _Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification_. Microb Genom, 7(11):000685 **2021**. PMID: 34739369. https://doi.org/10.1099/mgen.0.000685.

> Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. _RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation._ Nucleic Acids Res, **2021** Jan 8;49(D1):D1020-D1028. https://doi.org/10.1093/nar/gkaa1105

> Gurevich A, Saveliev V, Vyahhi N, Tesler G. _QUAST: quality assessment tool for genome assemblies_. Bioinformatics. 29(8):1072-5, **2013**. PMID: 23422339. https://doi.org/10.1093/bioinformatics/btt086.

> Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. _Producing polished prokaryotic pangenomes with the Panaroo pipeline_. Genome Biol. 21(1):180, **2020**. PMID: 32698896. https://doi.org/10.1186/s13059-020-02090-4.

> Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. _Sustainable data analysis with Snakemake_. F1000Research, 10:33, 10, 33, **2021**. https://doi.org/10.12688/f1000research.29032.2.
73 changes: 48 additions & 25 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,55 @@
## Workflow overview

A Snakemake workflow for the post-processing of microbial genome assemblies.

1. Parse `samples.csv` table containing the samples's meta data (`python`)
2. Annotate assemblies using one of the following tools:
1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)

## Running the workflow

### Input data

This workflow requires `fasta` input data.
The samplesheet table has the following layout:

| sample | species | strain | id_prefix | file |
| ----------- | ------------ | ------------- | ------------- | ------------- |
| EC2224 | "Streptococcus pyogenes" | SF370 | Spy | assembly.fasta |

### Execution

To run the workflow from command line, change to the working directory and activate the conda environment.

```bash
cd snakemake-assembly-postprocessing
conda activate snakemake-assembly-postprocessing
```

Adjust options in the default config file `config/config.yml`.
Before running the entire workflow, perform a dry run using:

```bash
snakemake --cores 1 --sdm conda --directory .test --dry-run
```

To run the workflow with test files using **conda**:

```bash
snakemake --cores 1 --sdm conda --directory .test
```
| sample | species | strain | id_prefix | file |
| ------ | ------------------------ | ------ | --------- | -------------- |
| EC2224 | "Streptococcus pyogenes" | SF370 | SPY | assembly.fasta |
| ... | ... | ... | ... | ... |

**Note:** Pangenome analysis with `Panaroo` requires at least two samples.

### Parameters

This table lists all parameters that can be used to run the workflow.

| Parameter | Type | Details | Default |
|:---|:---|:---|:---|
| **samplesheet** | string | Path to the sample sheet file in csv format | |
| **tool** | array[string] | Annotation tool to use (one of `prokka`, `pgap`, `bakta`) | |
| **pgap** | | PGAP configuration object | |
| bin | string | Path to the PGAP script | |
| use_yaml_config | boolean | Whether to use YAML configuration for PGAP | `False` |
| _prepare_yaml_files_ | | Paths to YAML templates for PGAP | |
| generic | string | Path to the generic YAML configuration file | |
| submol | string | Path to the submol YAML configuration file | |
| **prokka** | | Prokka configuration object | |
| center | string | Center name for Prokka annotation (used in sequence IDs) | |
| extra | string | Extra command-line arguments for Prokka | `--addgenes` |
| **bakta** | | Bakta configuration object | |
| download_db | string | Bakta database type (`full`, `light`, or `none`) | `light` |
| existing_db | string | Path to an existing Bakta database (optional). Needs to be combined with `download_db='none'` | `--keep-contig-headers --compliant` |
| extra | string | Extra command-line arguments for Bakta | |
| **quast** | | QUAST configuration object | |
| reference_fasta | string | Path to the reference genome for QUAST | |
| reference_gff | string | Path to the reference annotation for QUAST |
| extra | string | Extra command-line arguments for QUAST | |
| **panaroo** | | Panaroo configuration object | |
| remove_source | string | Source types to remove in Panaroo (regex supported) | `cmsearch` |
| remove_feature | string | Feature types to remove in Panaroo (regex supported) | `tRNA\|rRNA\|ncRNA\|exon\|sequence_feature` |
| extra | string | Extra command-line arguments for Panaroo | `--clean-mode strict --remove-invalid-genes` |
22 changes: 21 additions & 1 deletion config/config.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,29 @@
samplesheet: "config/samples.csv"
outdir: "results"
tool: ["prokka"]

pgap:
bin: "path/to/pgap.py"
use_yaml_config: True
prepare_yaml_files:
generic: "config/generic.yaml"
submol: "config/submol.yaml"

prokka:
center: ""
extra: "--addgenes"

bakta:
download_db: "light"
existing_db: ""
extra: "--keep-contig-headers --compliant"

quast:
reference_fasta: ""
reference_gff: ""
extra: ""

panaroo:
skip: False
remove_source: "cmsearch"
remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
extra: "--clean-mode strict --remove-invalid-genes"
71 changes: 68 additions & 3 deletions config/schemas/config.schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,15 @@ properties:
samplesheet:
type: string
description: Path to the sample sheet file
outdir:
type: string
description: Output directory for results
tool:
type: array
description: Annotation tool to use
items:
type: string
enum:
- prokka
- pgap
- bakta
pgap:
type: object
properties:
Expand All @@ -34,7 +40,66 @@ properties:
- bin
- use_yaml_config
- prepare_yaml_files
prokka:
type: object
properties:
center:
type: string
description: Center name for Prokka annotation (used in sequence IDs)
extra:
type: string
description: Extra command-line arguments for Prokka
required:
- center
- extra
bakta:
type: object
properties:
download_db:
type: string
description: Bakta database type, one of 'full', 'light', or 'none' if existing is used
existing_db:
type: string
description: Path to an existing Bakta database (optional)
extra:
type: string
description: Extra command-line arguments for Bakta
required:
- download_db
- existing_db
- extra
quast:
type: object
properties:
reference_fasta:
type: string
description: Path to the reference genome for QUAST
reference_gff:
type: string
description: Path to the reference annotation for QUAST
extra:
type: string
description: Extra command-line arguments for QUAST
panaroo:
type: object
properties:
skip:
type: boolean
description: Whether to skip Panaroo analysis
remove_source:
type: string
description: Source types to remove in Panaroo (regex supported)
remove_feature:
type: string
description: Feature types to remove in Panaroo (regex supported)
extra:
type: string
description: Extra command-line arguments for Panaroo

required:
- samplesheet
- tool
- pgap
- prokka
- bakta
- quast
1 change: 1 addition & 0 deletions resources/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.*
Loading
Loading