Skip to content

8. Map and phase reads

Tobias Hofmann edited this page Jun 21, 2016 · 14 revisions

phase_alleles.py

In the previous step we created alignments for each locus. We could proceed with these alignments for phylogenetic inference and stop at this point with the data processing. But there is more we can get from the sequence capture data. In this step we phase the read data in order to look for different alleles at each locus.
At first the script creates a consensus sequence for each contig-alignment, resulting in a new reference sequence for each locus. Then the reads are mapped against this new reference and a bam-assembly file is created. Then this assembly file is phased in an attempt to split the reads belonging to different alleles into separate files. At the current stage this produces two separate assembly files, each representing a different allele. (Note: If you are working with multiploid organisms, you may have to phase the resulting assemblies again manually in order to sort out all different alleles).
After that, a consensus sequence is created from each phased assembly file, representing the individual allele sequences. All sequences are stored in one joint fasta file (joined_allele_sequences_all_samples.fasta) in the output folder.
In the output folder of this script you will find multiple files for each sample. In each sample-folder (ID_remapped) you find the main bam-files containing all reads mapped to the reference, as well as the separate phased allele bam-files (subfolder "phased") and the consensus sequences of all these bam-files (subfolder "final_fasta_files").
If you activate the --no_duplicates flag, the script will first remove all duplicates (subfolder "picard") and will continue with phasing and building the consensus sequences from these duplicate-free bam files.

####Config file
Before you run the script you need to create a config file, containing the paths to several programs that this script is using. The programs you need to have installed for this script are:

  • Samtools
  • CLC-AssemblyCell
  • Bcftools (usually part of Samtools distribution)
  • Vcfutils (usually part of Samtools distribution)
  • EMBOSS
  • Picard (Optional)

The control file should contain the header "[paths]", followed by a separate line for each program. Each line should consist of the program name, followed by a colon [:], followed by the full path to the program executable. The config file should therefore look as follows:

[paths]
samtools:/path/to/samtools
clc:/path/to/clc-assembly-cell
bcftools:/path/to/bcftools
vcfutils:/path/to/vcfutils.pl
emboss:/path/to/EMBOSS
picard:/path/to/picard-tools-1.97

An example control file can be found in the folder "example_files" in this repository. This file can be used without alterations by those working on the Gothenburg University cluster "Albiorix".

####Run the script

The basic syntax of the script is as follows:
usage: phase_alleles.py [-h] --reads READS --alignments ALIGNMENTS --config CONFIG --output OUTPUT [--no_duplicates] [--conservative] [--cores CORES]

Here an explanation of the different flags:

--reads: After this flag you give the name/path of the folder containing the trimmed reads. This folder should still have the exact substructure that it had, when it was created by the trimming script in this pipeline (step 2). If you changed something or if you are using reads that were cleaned/trimmed by different software, make sure that the reads are organized in a separate subfolder for each sample. The name of the subfolder has to start with the sample name, delimited with an underscore [_]. (Required)

--alignments: This flag requires the name/path of the folder containing the contig-alingments from the previous step (step 6 and/or 7 in this pipeline). You can either use the separate exon sequences (step 6) as input or the joined exon sequences (step 7). (Required)

--config: Give the path/name of the configuration file, containing the full paths to the following programs: samtools, clc-assembly-cell, bcftools, vcfutils, emboss, picard (Required)

--output: Where you want the results to be stored. (Required)

--no_duplicates: Use this flag if you want to clean the mapped reads from all duplicates with Picard. This can result in some cases in a considerable decrease in read depth. (default: False)

--conservative: Use this flag if you want to discard all base calls with limited certainty (covered by <3 reads). These uncertain base calls will then be replaced with the ambiguity character "N". (default: False)

--l: Define the fraction of the read that has to fulfil the similarity-threshold. E.g. 0.7 means that a 70% length-fraction of the read has to match the similarity threshold which is set by the --s function. (default: 0.7)

--s: Set a similarity threshold, defining how similar the read has to be to the reference in order to be a match. (default: 0.9)

--cores: For parallel processing you can choose the number of cores you want CLC to run on. (default: 1)

#####Example python2.7 phase_alleles.py --reads path/to/cleaned-reads-folder --alignments path/to/alignment-folder/joined/fasta --config config.txt --output path/to/phasing-results --cores 12

Clone this wiki locally