Aligning the reads
Storing alignment of vast number of reads in fasta format (as the program
s2f.py in ShoRAH does) is largely inefficient, so that a new format has been proposed:
to align next-generation sequencing reads against even complete genomes are available. In the following, we will how to map your reads to the reference and produce an alignment in SAM format. Finally, we will extract a single window from the SAM alignment and run
is a collection of tools to manipulate SAM files. We assume that you have this installed and that the executable
is in your PATH.
is a python module that works as wrapper for SAMtools.
Creating a SAM alignment with Novoalign
This section reports how to create a SAM alignment with
. We have a (large) fastq file from an Illumina run,
, coming from sequencing a set of amplicons of human genome. The list of amplicons (reference genomes) is in the file
Novoaling needs to work with an index for the references that can be created with Novoindex (shipped with Novoalign). Working in the directory
sam_test, the command to create index file
all_amplicons_bi.idx is listed below.
[user@host sam_test]$ novoindex all_amplicons_bi.idx all_amplicons_bi.fasta
# novoindex (2.7) - Universal k-mer index constructor.
# (C) 2008 NovoCraft
# novoindex all_amplicons_bi.idx all_amplicons_bi.fasta
# Creating 9 indexing threads.
# Building with 6-mer and step of 1 bp.
# novoindex construction dT = 0.0s
# Index memory size 0.000Gbyte.
Novoalign can now be run to map reads against the set of amplicons. The output is in SAM format. This step can take long, depending on the size of the reference and, above all, the number of reads.
[user@host sam_test]$ novoalign -d all_amplicons_bi.idx -f s_1.fastq -o SAM > s_1.sam
# novoalign (2.07.00MT - Aug 5 2010 @ 18:45:42) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -d all_amplicons_bi.idx -f s_1.fastq -o SAM
# Interpreting input files as Sanger FASTQ.
# Index Build Version: 2.7
# Hash length: 6
# Step size: 1
# Read Sequences: 18155712
# Aligned: 15226108
# Unique Alignment: 15225411
# Gapped Alignment: 240857
# Quality Filter: 1697233
# Homopolymer Filter: 10161
# Elapsed Time: 511.765 (sec.)
# CPU Time: 30.3 (min.)
Creating a sorted BAM alignment from a SAM alignment
As a preliminary step, the reference genome must be indexed with samtools.
[user@host sam_test]$ samtools faidx all_amplicons_bi.fasta
[user@host sam_test]$ ls all_amplicons_bi.fasta*
The following extract the sub alignments, sort them according to leftmost coordinate, write the output in BAM format and index the BAM file.
[user@host sam_test]$ samtools view -b -t all_amplicons_bi.fasta s_1.sam | samtools sort - s_1_all_amplicons
[samopen] SAM header is present: 11 sequences.
[bam_sort_core] merging from 8 files...
[user@host sam_test]$ samtools index s_1_all_amplicons.bam
[user@host sam_test]$ ls s_1_all_amplicons.*
Now the program included in ShoRAH
bam2msa.py extracts a single window from the alignment of all reads. The following command extracts reads that map to chromosome 'VHL_Ex.3-3' (sequence name in the fasta file of the amplicons) from position 1541 to 1610 with a minimum overlap of 65 bases.
[user@host sam_test]$ path_to_sho_bin/bam2msa.py -c 'VHL_Ex.3-3' -l 1541 -r 1610 -m 65 -o illu_ex_3.far s_1_all_amplicons.bam
[user@host sam_test]$ grep '>' illu_ex_3.far | wc -l
[user@host sam_test]$ head illu_ex_3_.far
>HWUSI-EAS000_1:1:1:228:489 | NoInsertions
>HWUSI-EAS000_1:1:1:259:211 | NoInsertions
>HWUSI-EAS000_1:1:1:304:711 | NoInsertions
>HWUSI-EAS000_1:1:1:764:609 | NoInsertions
The output option (
-o) had to be given immediately after the executable in previous release of the software. This constraint has now been removed. The output
illu_win.far is a file of aligned reads in fasta format, which we can use for local analysis.
The resulting MSA file can be used as input for the program
diri_sampler. Running as usual with the option
-h gives a list of options. The examples runs the program on the file
illu_win.far for 5000 iterations, 4000 of which are considered as burn-in, and 10 initial clusters. Each run of
diri_sampler creates 5 files. The run can be followed by observing the
[user@host sam_test]$ path_to_sho_bin/diri_sampler -i illu_ex_3_.far -j 5000 -t 1000 -K 10 &
[user@host sam_test]$ tail -f illu_ex_3_.smp
8 44 6763 0.953735 0.558661
9 44 6917 0.954089 0.55774
10 44 6842 0.954702 0.55774
11 44 6987 0.95484 0.557432
12 44 6892 0.955058 0.557125
[user@host sam_test]$ ls illu*
illu_ex_3-cor.fas illu_ex_3.dbg illu_ex_3.far illu_ex_3-freq.csv illu_ex_3.smp illu_ex_3-support.fas
For an explanation of what is included in all these files, see Intermediate files. The most important for our current analysis is
illu_ex_3-support.fas. Inspecting the head of the file we see the haplotypes with the highest support.
[user@host sam_test]$ head illu_ex_3-support.fas
We can discard haplotype number 4 and focus on the others. Two haplotypes have very low number of reads, while haplotype 0 and haplotype 2 account for most of the reads. By inspecting them we observe two variants: at position 12 haplotype 0 has a
T, 2 has a
C, and at position 65 0 has a
C and 2 has a