ShoRAH is a software package that allows for inference about the structure of a population from a set of short sequence reads as obtained from ultra-deep sequencing of a mixed sample.
The package contains programs that support mapping of reads to a reference genome, correcting sequencing errors by locally clustering reads in small windows of the alignment, reconstructing a minimal set of global haplotypes that explain the reads, and estimating the frequencies of the inferred haplotypes.
Ideally, the whole program consists of four steps: alignment, error correction, global haplotype reconstruction and frequency estimation.
We are given a set of NGS reads (thousands or even millions of reads) and we have to build a MSA (multiple sequence alignment), necessary for the following steps. Classical off-the-shelf aligner as ClustalW or Muscle are not an option here, so we have to proceed differently. First we align all reads to a reference. Then, using all gaps found in the set of pairwise alignments, we build the MSA. This is performed by the program
Technical errors are corrected by grouping reads according to their similarity. Reads found in the same group are considered as originating from the same haplotype, and their differences removed (interpreted as coming from technical errors only). Their number gives a first estimate of the frequency of the reconstructed haplotype.
In many cases this information is already very valuable (see Local analysis).
mm.py builds a read graph, a representation in which reads are vertices in a graph, and edges connect all pairs of reads that overlap (the end of a read is equal to the beginning of another). In this representation, a global haplotype is a path; an ordered list of reads (vertices) connected by a series of edges. The program, following a parsimony principle, returns the minimal number of haplotypes that explain all reads (that is, the minimal number of paths covering all vertices).
The frequencies of the global haplotypes are estimated by the program
freqEst. It considers all reads corrected by
dec.py and all haplotypes reconstructed by
mm.py and, using the EM algorithm, finds the maximum likelihood estimate for the frequencies.
See also References