no links exist
By local structure of the population we mean the distribution of haplotypes on the length of a single window (typically the length of the reads).
In many cases this is an already valuable information, e.g. with the current Roche/454 specifications, one read is long enough to cover the protease portion of the pol gene in HIV, an important target for anti-viral drugs.
ShoRAH is very reliable in detecting local population structure (see References).
The first step for all types of analysis is producing a MSA of reads in our sample. We will use the sample file included in the ShoRAH distribution
sample_454.fasta. We need a reference sequence against which align the reads, in fasta format as well:
ref_genome.fasta. The program
s2f.py is the tool that aligns all reads to the reference (in both orientation, choosing the best one) and from the set of pairwise alignments builds a MSA, stored in a file with the extension
Let's copy our two files from the
shorah-0.4 directory to a new location:
experiments/454_test, and build the MSA by invoking the program
We have called
s2f.py followed by some options to produce
sample_454.far, the MSA. Options for
s2f.py and for the other programs can be examined by calling it with
dec.py takes the MSA of all the reads, cuts it in windows of specific width and passes all the windows to
diri_sampler. The output of this procedure is a set of several files that contain information on the local structure.
Depending on your sample, the options and your computer performance, this step might take several hours or more and produces a number of directories as shown below. Here we run
dec.py over our sample file, choosing a window width equal to 90 base pairs (
-w), alpha=1.0 (
-a) and 1000 iterations (
-j). Configurations are recorded for 100 iterations (
-t). For the local analysis it's essential to give the option
-k. In the next panel we run
dec.py and list the directories produced that contain the results.
sample_454.cor.fas contains the corrected reads, and it is the input for the following steps of Global analysis.
support contains (compressed) files with the most important results for the local analysis. See Intermediate files for a description of the others.
Files in the support directory are fasta files named according to the beginning and the end of the analysed window in the MSA. So, for example file
support/w1-90.reads-support.fas.gz contains the results of the analysis on window from position 1 to position 90 on the MSA.
posterior is the quality of the reconstruction, while
ave_reads is the number of reads associated, on average, to the reconstructed haplotype. In the example shown, our confidence in haplotype
hap_0 is 100%, and the program detected an average of 144.8 reads associated to it. In a conservative approach, only haplotypes with posterior > 0.9 should be considered. Haplotype
hap_3, for example, is only an artefact of the sampling procedure and should not be considered (quality 4.5%).
When running with option
dec.py produces and retains several files in the subdirectory
corrected/ of the form
w1-99.reads-cor.fas.gz, where 1 and 99 are, respectively, the beginning and the end of the window under consideration. If the user wishes to rerun
dec.py on selected windows only, one should simply delete the corresponding file in the directory
NOTE: The error correction implemented in diri_sampler will not run for any window whose corresponding file is present in the directory
If the user wishes to change the number of iterations of
diri_sampler (because the burn-in phase is shorter or longer than expected) one can write a file
w1-99.iter (again 1 and 99 are the boundaries of the window) with a single line that contains the desired number of iterations. Similarly, one can change alpha by writing into a file
w1-99.alpha a float with the new value of alpha. While alpha can be increased and decreased by editing the
.alpha file without adversely affecting the functioning of the program (but obviously its result), we recommend changing the number of iterations at most once.
The new implementation of
diri_sampler clubs identical reads into objects defined by the read sequence and the number of reads in the object. This allows for faster computation, but might introduce a bias into the sampling procedure. One can limit the maximum number of identical reads in a single object by editing the variable
dpm_src/dpm_sampler.h (default is 10000)
dec.py outputs several files. The file
sample_cor.fas, which contains the corrected reads, is passed by shorah (or by the user) to the following step of the global analysis,
dec.py also outputs a file called
proposed.dat. It contains the number of proposed new clusters in each window. One should always make sure that a sufficient number of new cluster has been proposed (rule of thumb, at least 0.1 per step or more). When the program is run with the
-k option, several directories are created. They contain zipped files that report information on the error correction procedure, as explained below.
In order to make the analysis of the sampling process easier, two additional python programs are added:
plot_stat.py. In order to use all of their functionality, numpy and matplotlib are required (http://www.scipy.org and http://matplotlib.sourceforge.net/).
The shorter reads obtained on Illumina platforms, while making the global analysis more challenging, can be used to analyse a very short region of the genome at very deep coverage. To this end we explain how to run local analysis on Illumina data in the page Illumina-like data.