Linking here:

Error formatting macro: incoming-links: java.lang.NullPointerException

no links exist

Recently updated pages

Navigate space

Child pages
  • Short-read data

[ShoRAH 0.5 docs]">[ShoRAH 0.5 docs]

Skip to end of metadata
Go to start of metadata

Aligning the reads

Storing alignment of vast number of reads in fasta format (as the program s2f.py in ShoRAH does) is largely inefficient, so that a new format has been proposed:

Many

to align next-generation sequencing reads against even complete genomes are available. In the following, we will how to map your reads to the reference and produce an alignment in SAM format. Finally, we will extract a single window from the SAM alignment and run diri_sampler on it.

Requirements

is a collection of tools to manipulate SAM files. We assume that you have this installed and that the executable samtools is in your PATH.

is a python module that works as wrapper for SAMtools.

Creating a SAM alignment with Novoalign

This section reports how to create a SAM alignment with

. We have a (large) fastq file from an Illumina run, s_1.fastq, coming from sequencing a set of amplicons of human genome. The list of amplicons (reference genomes) is in the file all_amplicons_bi.fasta.

Novoaling needs to work with an index for the references that can be created with Novoindex (shipped with Novoalign). Working in the directory sam_test, the command to create index file all_amplicons_bi.idx is listed below.

Create index from reference genome

[user@host sam_test]$ novoindex all_amplicons_bi.idx all_amplicons_bi.fasta
# novoindex (2.7) - Universal k-mer index constructor.
# (C) 2008 NovoCraft
# novoindex all_amplicons_bi.idx all_amplicons_bi.fasta
# Creating 9 indexing threads.
# Building with 6-mer and step of 1 bp.
# novoindex construction dT = 0.0s
# Index memory size 0.000Gbyte.
# Done.

Novoalign can now be run to map reads against the set of amplicons. The output is in SAM format. This step can take long, depending on the size of the reference and, above all, the number of reads.

Create SAM alignment

[user@host sam_test]$ novoalign -d all_amplicons_bi.idx -f s_1.fastq -o SAM > s_1.sam
# novoalign (2.07.00MT - Aug 5 2010 @ 18:45:42) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -d all_amplicons_bi.idx -f s_1.fastq -o SAM
# Interpreting input files as Sanger FASTQ.
# Index Build Version: 2.7
# Hash length: 6
# Step size: 1
# Read Sequences: 18155712
# Aligned: 15226108
# Unique Alignment: 15225411
# Gapped Alignment: 240857
# Quality Filter: 1697233
# Homopolymer Filter: 10161
# Elapsed Time: 511.765 (sec.)
# CPU Time: 30.3 (min.)
# Done.

Creating a sorted BAM alignment from a SAM alignment

As a preliminary step, the reference genome must be indexed with samtools.

Index the reference genome

[user@host sam_test]$ samtools faidx all_amplicons_bi.fasta
[user@host sam_test]$ ls all_amplicons_bi.fasta*
all_amplicons_bi.fasta all_amplicons_bi.fasta.fai

The following extract the sub alignments, sort them according to leftmost coordinate, write the output in BAM format and index the BAM file.

Sort SAM file

[user@host sam_test]$ samtools view -b -t all_amplicons_bi.fasta s_1.sam | samtools sort - s_1_all_amplicons
[samopen] SAM header is present: 11 sequences.
[bam_sort_core] merging from 8 files...
[user@host sam_test]$ samtools index s_1_all_amplicons.bam
[user@host sam_test]$ ls s_1_all_amplicons.*
s_1_all_amplicons.bam s_1_all_amplicons.bam.bai

Now the program included in ShoRAH bam2msa.py extracts a single window from the alignment of all reads. The following command extracts reads that map to chromosome 'VHL_Ex.3-3' (sequence name in the fasta file of the amplicons) from position 1541 to 1610 with a minimum overlap of 65 bases.

From BAM to MSA

[user@host sam_test]$ path_to_sho_bin/bam2msa.py -c 'VHL_Ex.3-3' -l 1541 -r 1610 -m 65 -o illu_ex_3.far s_1_all_amplicons.bam
[user@host sam_test]$ grep '>' illu_ex_3.far | wc -l
7326
[user@host sam_test]$ head illu_ex_3_.far
>HWUSI-EAS000_1:1:1:228:489 | NoInsertions
GGGATTACAGGCGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGTnnnnn
>HWUSI-EAS000_1:1:1:259:211 | NoInsertions
GGGATTACAGGCGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGTnnnnn
>HWUSI-EAS000_1:1:1:304:711 | NoInsertions
GGGATTACAGGCGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGTnnnnn
>HWUSI-EAS000_1:1:1:764:609 | NoInsertions
GGGATTACAGGCGCCTGCCACCNCGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGTnnnnn

The output option (-o) had to be given immediately after the executable in previous release of the software. This constraint has now been removed. The output illu_win.far is a file of aligned reads in fasta format, which we can use for local analysis.

Local analysis

The resulting MSA file can be used as input for the program diri_sampler. Running as usual with the option -h gives a list of options. The examples runs the program on the file illu_win.far for 5000 iterations, 4000 of which are considered as burn-in, and 10 initial clusters. Each run of diri_sampler creates 5 files. The run can be followed by observing the smp file.

Run single window analysis

[user@host sam_test]$ path_to_sho_bin/diri_sampler -i illu_ex_3_.far -j 5000 -t 1000 -K 10 &
[user@host sam_test]$ tail -f illu_ex_3_.smp
8 44 6763 0.953735 0.558661
9 44 6917 0.954089 0.55774
10 44 6842 0.954702 0.55774
11 44 6987 0.95484 0.557432
12 44 6892 0.955058 0.557125
...
...
[user@host sam_test]$ ls illu*
illu_ex_3-cor.fas illu_ex_3.dbg illu_ex_3.far illu_ex_3-freq.csv illu_ex_3.smp illu_ex_3-support.fas

For an explanation of what is included in all these files, see Intermediate files. The most important for our current analysis is illu_ex_3-support.fas. Inspecting the head of the file we see the haplotypes with the highest support.

Results of local reconstruction

[user@host sam_test]$ head illu_ex_3-support.fas
>hap_0|posterior=1.002 ave_reads=3433.35
GGGATTACAGGTGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGCCATGT
>hap_1|posterior=1 ave_reads=21.432
GGGATTACAGGCGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGGTTCGTCATGT
>hap_2|posterior=1 ave_reads=3805.49
GGGATTACAGGCGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGTCATGT
>hap_3|posterior=1 ave_reads=25.645
GGGATTACGGGCGCCTGCCACCACGCTGGCCAATTTTTGTACTTTTAGTAGAGACAGTGTTTCGTCATGT
>hap_4|posterior=0.862 ave_reads=0.862
GGGATTACACGCGCCTGCCACCACGCTGTCCAATTTTTCTACTTTTACTAGAGTCTGCGGCTCGTCATGT

We can discard haplotype number 4 and focus on the others. Two haplotypes have very low number of reads, while haplotype 0 and haplotype 2 account for most of the reads. By inspecting them we observe two variants: at position 12 haplotype 0 has a T, 2 has a C, and at position 65 0 has a C and 2 has a T.