Linking here:

Error formatting macro: incoming-links: java.lang.NullPointerException

no links exist

Recently updated pages

Navigate space

Child pages
  • Local analysis

[ShoRAH 0.5 docs]">[ShoRAH 0.5 docs]

Skip to end of metadata
Go to start of metadata

Jump to the examples: Roche/454 | Illumina

By local structure of the population we mean the distribution of haplotypes on the length of a single window (typically the length of the reads).
In many cases this is an already valuable information, e.g. with the current Roche/454 specifications, one read is long enough to cover the protease portion of the pol gene in HIV, an important target for anti-viral drugs.

ShoRAH is very reliable in detecting local population structure (see References).

Roche/454 data

Creating a multiple sequence alignment (MSA) from a fasta file

The first step for all types of analysis is producing a MSA of reads in our sample. We will use the sample file included in the ShoRAH distribution sample_454.fasta. We need a reference sequence against which align the reads, in fasta format as well: ref_genome.fasta. The program is the tool that aligns all reads to the reference (in both orientation, choosing the best one) and from the set of pairwise alignments builds a MSA, stored in a file with the extension .far.
Let's copy our two files from the shorah-0.4 directory to a new location: experiments/454_test, and build the MSA by invoking the program

Build the alignment

[user@host 454_test]$ cp path_to_sho_bin/shorah-0.4/sample_454.fasta ./
[user@host 454_test]$ cp path_to_sho_bin/shorah-0.4/ref_genome.fasta ./
[user@host 454_test]$ ls
ref_genome.fasta sample_454.fasta
[user@host 454_test]$ path_to_sho_bin/shorah-0.4/ -f sample_454.fasta -r ref_genome.fasta -o sample_454.far
[user@host 454_test]$ ls
ref_genome.fasta s2f.log sample_454.far sample_454.fasta tmp_align_f.needle tmp_align_r.needle

We have called followed by some options to produce sample_454.far, the MSA. Options for and for the other programs can be examined by calling it with -h or --help.

Correcting sequencing errors to locally reconstruct the population takes the MSA of all the reads, cuts it in windows of specific width and passes all the windows to diri_sampler. The output of this procedure is a set of several files that contain information on the local structure.
Depending on your sample, the options and your computer performance, this step might take several hours or more and produces a number of directories as shown below. Here we run over our sample file, choosing a window width equal to 90 base pairs (-w), alpha=1.0 (-a) and 1000 iterations (-j). Configurations are recorded for 100 iterations (-t). For the local analysis it's essential to give the option -k. In the next panel we run and list the directories produced that contain the results.

Error correction

[user@host 454_test]$ path_to_sho_bin/shorah-0.4/ -f sample_454.far -w 90 -a 1.0 -j 1000 -t 100 -k
[user@host 454_test]$ ls -lrhtp
... more files ...
drwxr-xr-x 13 user user 442B Jul 21 16:05 support/
drwxr-xr-x 13 user user 442B Jul 21 16:05 sampling/
drwxr-xr-x 13 user user 442B Jul 21 16:05 raw_reads/
drwxr-xr-x 13 user user 442B Jul 21 16:05 freq/
drwxr-xr-x 13 user user 442B Jul 21 16:05 debug/
drwxr-xr-x 13 user user 442B Jul 21 16:05 corrected/
drwxr-xr-x 10 user user 227K Jul 21 16:05 sample_454.cor.fas
... more files ...

The file sample_454.cor.fas contains the corrected reads, and it is the input for the following steps of Global analysis.
The directory support contains (compressed) files with the most important results for the local analysis. See Intermediate files for a description of the others.

Results are in support files

Files in the support directory are fasta files named according to the beginning and the end of the analysed window in the MSA. So, for example file support/w1-90.reads-support.fas.gz contains the results of the analysis on window from position 1 to position 90 on the MSA.

>hap_0|posterior=1 ave_reads=144.81
>hap_1|posterior=1 ave_reads=11.93
>hap_2|posterior=1 ave_reads=46
>hap_3|posterior=0.045 ave_reads=0.045

The field posterior is the quality of the reconstruction, while ave_reads is the number of reads associated, on average, to the reconstructed haplotype. In the example shown, our confidence in haplotype hap_0 is 100%, and the program detected an average of 144.8 reads associated to it. In a conservative approach, only haplotypes with posterior > 0.9 should be considered. Haplotype hap_3, for example, is only an artefact of the sampling procedure and should not be considered (quality 4.5%).


Rerunning on selected windows

When running with option -k, produces and retains several files in the subdirectory corrected/ of the form w1-99.reads-cor.fas.gz, where 1 and 99 are, respectively, the beginning and the end of the window under consideration. If the user wishes to rerun on selected windows only, one should simply delete the corresponding file in the directory corrected/.
NOTE: The error correction implemented in diri_sampler will not run for any window whose corresponding file is present in the directory corrected/.

Changing parameters on the fly

If the user wishes to change the number of iterations of diri_sampler (because the burn-in phase is shorter or longer than expected) one can write a file w1-99.iter (again 1 and 99 are the boundaries of the window) with a single line that contains the desired number of iterations. Similarly, one can change alpha by writing into a file w1-99.alpha a float with the new value of alpha. While alpha can be increased and decreased by editing the .alpha file without adversely affecting the functioning of the program (but obviously its result), we recommend changing the number of iterations at most once.

Maximum weight of read objects

The new implementation of diri_sampler clubs identical reads into objects defined by the read sequence and the number of reads in the object. This allows for faster computation, but might introduce a bias into the sampling procedure. One can limit the maximum number of identical reads in a single object by editing the variable LIMIT in dpm_src/dpm_sampler.h (default is 10000)

Output files

The program outputs several files. The file sample_cor.fas, which contains the corrected reads, is passed by shorah (or by the user) to the following step of the global analysis, also outputs a file called proposed.dat. It contains the number of proposed new clusters in each window. One should always make sure that a sufficient number of new cluster has been proposed (rule of thumb, at least 0.1 per step or more). When the program is run with the -k option, several directories are created. They contain zipped files that report information on the error correction procedure, as explained below.

support files

In order to make the analysis of the sampling process easier, two additional python programs are added: and In order to use all of their functionality, numpy and matplotlib are required ( and

Illumina data

The shorter reads obtained on Illumina platforms, while making the global analysis more challenging, can be used to analyse a very short region of the genome at very deep coverage. To this end we explain how to run local analysis on Illumina data in the page Illumina-like data.