Skip to content

6. Align extracted contig sequences

Tobias Hofmann edited this page Oct 26, 2015 · 11 revisions

align_contigs.py

After having identified and extracted all relevant contig sequences into one single fasta collection (output from the previous step), it is now time to build alignments. All you need for this step, is that fasta file from the previous contig-extraction (matches.fasta in the example command). However, there are a lot of available options for this script that can be fine-tuned in order to build good alignments. You will probably have to run the script multiple times with different settings in order to receive good alignment files (see section "Quality control" below for more details). For now we will just try to get the script running and introduce you to the available options.

Run the script

These are the available options that you are presented when running the script with the help command (--h flag).
usage: align_contigs.py [-h] --sequences SEQUENCES --output OUTPUT [--aligner {dialign,muscle,mafft}] [--output-format {fasta,nexus,phylip,clustal,emboss,stockholm}] [--verbosity {INFO,WARN,CRITICAL}] [--log-path LOG_PATH] [--no-trim] [--window WINDOW] [--proportion PROPORTION] [--threshold THRESHOLD] [--max-divergence MAX_DIVERGENCE] [--min-length MIN_LENGTH] [--ambiguous] [--cores CORES]

Here a brief explanation of the available flags:

--sequences: The fasta file containing the extracted contigs that match the target loci. (Required)

--output: The directory in which to store the resulting alignments. (Required)

--aligner: The alignment engine to use. You can choose between dialign, muscle, and mafft. (default: mafft)

--output-format: Here you can choose the format for your alignments. The choices include: fasta, nexus, phylip, clustal, emboss, stockholm. (default: fasta)

--verbosity: The logging level to use. Choose between INFO, WARN, CRITICAL. (default: INFO)

--log-path: The path to a directory where to store the log file. (default: None)

--no-trim: Align, but DO NOT trim alignments. Usually alignments are trimmed in order to remove stretches of single sequences that are only present for one or very few taxa (see further settings and thresholds below). If you don't want the script to trim your alignments type TRUE after this flag. (default: False)

--window: Sliding window size for trimming in bp. Changing the size of the window influences the yield of alignments fulfilling the below requirements and thresholds. A smaller window size results in more alignments being recovered. Window sizes <10 make the trimming of alignments redundant in most cases. (default: 10)

--proportion: The proportion of taxa required to have sequence at alignment ends. Everything below this threshold is trimmed for optimal alignment length. (default: 0.5)

--threshold: The proportion of sequences required within the window. Everything below this threshold is trimmed for optimal alignment length. (default: 0.5)

--max-divergence: The max proportion of sequence divergence allowed between any row of the alignment and the alignment consensus. Lowering this threshold can help to discard alignments containing paralogous sequences for some taxa. (default: 0.4)

--min-length: The minimum length of alignments to keep. (default: 80)

--ambiguous: Allows reads in alignments containing N-bases. Type TRUE after this flag in order to activate this option. (default: False)

--cores: Process alignments in parallel on multiple cores for faster processing. (default: 1)

#####Example:
python2.7 align_contigs.py --sequences path/to/matches-folder/matches.fasta --output path/to/alignment-folder/fasta --max-divergence 0.2

Note: Check the screen-output and/or the log-file in order to see how many of the possible alignments were discarded. You can increase the yield of alignments by experimenting the parameters/thresholds above. The default settings are chosen rather lenient in order to provide a sufficient yield of alignments.

Quality control

It is strongly recommendable and common practice that you check your alignments by hand (in an alignment viewer such as Bioedit, Geneious, etc.), in order to decide which of the alignment settings of this script to use for your data. Thorough inspection of your alignments might even take you back to step 4 of this workflow (Finding target contigs), if some of the contigs that you now see in the alignments, appear to be inaccurate matches (such as e.g. paralogous, see 3rd sequence from the bottom in alignment below); this can maybe be prevented by choosing more conservative settings for the matching process in step 4. Or maybe you find that you have too many missing taxa in most of your alignments and you want to change the match-settings to be a bit more lenient and in return catch more contigs.
Advice: You may save yourself some work of skipping through endless alignment files if you already move to the next step of joining the separate exons for each gene. In that step (which is very quick) you will drastically reduce the number of alignments which makes it easier to get a visual overview over your data.

image
Example of inaccurate contig-match for one sample (3rd sequence from the bottom). This will only become visible after visually inspecting your alignments and requires more conservative thresholds in the identification of target loci in step 4 of this workflow ("Finding target contigs").

Clone this wiki locally