Figure 1.
A: Profile HMM architecture used by HMMER3 [4], [5], [31]. Regions homologously aligned to the query are represented by a linear core model consisting of consensus positions (in this example,
), each consisting of a match, a delete, and an insert state (shown as boxes marked M, circles marked D, and diamonds marked I), connected by state transition probabilities (arrows). Match states carry position-specific emission probabilities for scoring residues at each consensus position. Insert states emit residues with emission probabilities identical to a background distribution. Additional flanking states (marked N, C, and J) emit zero or more residues from the background distribution, modeling nonhomologous regions preceding, following, or joining homologous regions aligned to the core model. Start (S), begin (B), end (E) and termination (T) states do not emit. B: The MSV profile is formed by implicitly treating all match-match transition probabilities as 1.0. This corresponds to the virtual removal of the delete and insert states. The rest of the profile parameterization stays the same. This model generates sequences containing one or more ungapped local alignment segments. Note that both models appear to be improperly normalized; for example, each match state in the MSV model has probability 1.0 local exit transition (orange arrows) in addition to the probability 1.0 match-match transition. This is because of a trick used to establish a uniform local fragment length distribution, in which these profiles are collapsed representations of a much larger (and properly normalized) “implicit probability model”, as explained in [31]. C: An example of what an alignment of a larger MSV profile (of length
) to a target sequence (of length
) might look like, as a path through a dynamic programming (DP) matrix. Here, the model identifies two high-scoring ungapped alignment segments (black dots, indicating residues aligned to profile match states), and assigns all other residues to N, J, and C states in the model (orange dots; unfilled indicates a “mute” nonemitting state or state transition). Note that the ungapped diagonals are not enforced to be consistent with a single gapped alignment.
Figure 2.
Illustration of striped indexing for SIMD vector calculations.
The top row (magenta outline) shows one row of the dynamic programming lattice for a model of length . Assuming an example of vectors containing
cells each, the 14 cells
are contained in
vectors numbered
. (Two unused cells, marked x, are set to a sentinel value.) In the dynamic programming recursion, when we calculate each new cell
in a new row
, we access the value in cell
in the previous row
. With striped indexing, vector
contains exactly the four
cells needed to calculate the four cells
in a new vector
on a new row of the dynamic programming matrix (turquoise outline). For example, when we calculate cells
in vector
, we access the previous row's vector
which contains the cells we need in the order we need them,
(dashed lines and box). If instead we indexed cells into vectors in the obvious way, in linear order (
in vector
and so on), there is no such correspondence of
with four
's, and each calculation of a new vector
would require expensive meddling with the order of cells in the previous row's vectors. With striped indexing, only one shift operation is needed per row, outside the innermost loop: the last vector on each finished row is rightshifted (mpv, in grey with red cell
indices) and used to initialize the next row calculation.
Figure 3.
MSV scores follow a predictable distribution.
A: example MSV score distributions for a typical Pfam model, CNP1, on random i.i.d. sequences of varying lengths from 25 to 25,600, with the shortest, typical, and longest lengths highlighted as red, black, and blue lines, respectively. The predicted distribution, following the procedure of [31] including an edge correction on the slope
, is shown in orange (though largely obscured by the data lines right on top of it). B: Histogram of maximum likelihood
values obtained from score distributions of 11,912 Pfam models, showing that most are tolerably close to the conjectured
, albeit with more dispersion for default entropy-weighted models (black line) than high relative entropy models without entropy-weighting (gray line). C: The observed fraction of nonhomologous sequences that pass the filter at a P-value of 0.02 should be 0.02. Histograms of the actual filter fraction for 11,912 different Pfam 24 models are shown, for a range of random sequence lengths from 25 to 25,600, for both default models (black lines) and high relative entropy models with no entropy weighting (gray lines).
Figure 4.
The HMMER3 acceleration pipeline.
Representative calculation speeds are shown in red, in units of millions of dynamic programming cells per second (Mc/s). Default P-value thresholds for MSV, Viterbi, and Forward filtering steps are shown in orange. The bias filter and the domain definition steps are not described in detail in this manuscript, and are shown in gray.
Figure 5.
Each point represents a speed measurement for one search with one query against target sequences (
for the slow HMMER2 and SAM programs,
for FASTA and SSEARCH), on a single CPU core (see Methods for more details). Both axes are logarithmic, for speed in millions of dynamic programming cells per second (Mc/s) on the y-axis and query length in residues on the x-axis. Panel A shows “typical best performance” speed measurements for several different programs including HMMER3, for 76 queries of varying consensus lengths, chosen from Pfam 24, for searches of randomized (shuffled) target sequences. Panel B shows a wider range of more realistic speed measurements for all 11,912 profiles in Pfam 24, on searches of real target protein sequences from UniProt TrEMBL.
Figure 6.
Benchmark of search sensitivity and specificity.
For different programs, searches are performed either by constructing a single profile from the query alignment (HMMER3, HMMER2, SAM, PSI-BLAST), or by using “family pairwise search” [41] in which each individual sequence is used as a query and the best E-value per target sequence is recorded (BLASTP, SSEARCH, FASTA). In each benchmark, true positive subsequences have been selected to be no more than 25% identical to any sequence in the query alignment (see Methods). Panel A shows results where nonhomologous sequence has been synthesized by a simple random model, and each true positive sequence contains a single embedded homologous subsequence (a total of 2,141 query multiple alignments, 11,547 true positive sequences, and 200,000 decoys). Panel B shows results where nonhomologous sequence is synthesized by shuffling randomly chosen subsequences from UniProt, and each true positive contains two embedded homologous subsequences (a total of 2,141 query alignments, 24,040 true positive sequences, and 200,000 decoys). The Y-axis is the fraction of true positives detected with an E-value better than the number of false positives per query specified on the X-axis.