^{*}

Conceived and designed the experiments: SE. Performed the experiments: SE. Analyzed the data: SE. Contributed reagents/materials/analysis tools: SE. Wrote the paper: SE.

The author has declared that no competing interests exist.

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (

Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the

Sequence similarity searching was advanced by the introduction of probabilistic modeling methods, such as profile hidden Markov models (profile HMMs) and pair-HMMs

More sophisticated scoring models are desirable but not sufficient. It is also necessary to be able to determine the statistical significance of a score efficiently and accurately

The first problem is that Karlin/Altschul statistics only rigorously apply to scores of optimal

The second problem is that in terms of probabilistic inference, an optimal alignment score is not the score we should be calculating in a homology search. The quantity we want to calculate is the total log likelihood ratio for the target sequence(s) given an evolutionary model and a null hypothesis,

Here I test two conjectures about the expected distributions of scores for full probabilistic models: that optimal gapped alignment scores (Viterbi scores) follow Gumbel distributions with a constant

This work was done as part of a reimplementation of the HMMER profile HMM software package

Let us start with a definition of Viterbi and Forward scores in terms of probabilistic inference. We have a _{1}…_{L}

Typically, model

A model might require the complete query and target sequences to be aligned and homologous – a

The simplest random model

The likelihoods of

The

The logarithms may be taken to any base

The names Viterbi and Forward refer to the standard dynamic programming algorithms used to calculate these scores in the specific case of HMMs

Traditional search algorithms report optimal alignment scores, so the Viterbi score is the probabilistic analog of traditional methods. However, from a probabilistic inference standpoint, the Forward score is what we want, because we are after the probability that sequence

Forward scores are not generally used in traditional sequence comparison, because they only make sense if individual alignments have probabilities

Local optimal alignment scores of random sequences (

In contrast to optimal alignment scores, the distribution of Forward scores is unknown. It has appeared “fat-tailed” relative to the high-scoring exponential tail of the Gumbel distribution of Viterbi scores

I made the following two conjectures about

The Gumbel distribution of Viterbi scores has a fixed

The high-scoring tail of Forward scores is exponentially distributed with the same

These conjectures are based on three main lines of argument, two of which depend heavily on the work of Bundschuh and his collaborators.

First, for Viterbi scores, Bundschuh's “central conjecture” about the distribution of optimal gapped local alignment scores states that _{ab}_{ab}

Second, for Forward scores, Milosavljević proved in his “algorithmic significance” method that an upper bound for the distribution ^{−t log z}

Third, for Forward scores, Yu, Bundschuh, and Hwa argued by a different approach that the high-scoring tail ^{−t log z}, i.e. again, an exponential tail with

Additionally, one expects the high-scoring tail of Forward scores to approximate the high-scoring tail of Viterbi scores (so Gumbel-distributed Viterbi scores and exponential-tailed Forward scores would have the same

In practice, however, the simulation-calibrated

I modified HMMER's profile HMM architecture in several details, with the main goal of achieving a uniform query entry/exit distribution in local alignments. A uniform query entry/exit distribution means that for a query profile of

Besides HMMER's previous model, several other probabilistic local alignment models in the literature also imply nonuniform entry/exit distributions. For example, simple pair-HMMs for pairwise local sequence alignment imply a non-uniform (geometric) distribution over local alignment length, because they use a single residue alignment state with a self-loop and an exit probability

The generative probabilistic model of local alignment that I intend to use in HMMER3 is illustrated in

(A) an example of a core model with five consensus positions. Each consensus position of the query is modeled by a

Therefore, the search profile is not probabilistic per se. It is a dynamic programming construct that calculates correct probabilities for the implicit probabilistic model. It uses entry probabilities of 2/

The N, C, J state transitions, plus the self-loop transition in the null hypothesis HMM

How should the three target length model parameters be set? I will discuss the rationale in more detail in a later section, in the context of illustrative simulation results. For now I will just state that

Traditional sequence similarity search methods distinguish local, global, and glocal alignments, applying different alignment algorithms, while using the same scoring system. (A

In a probabilistic inference framework, these distinctions are not in the algorithm, but in the parameterization and architecture of the model _{1}…_{L}_{i}

Viterbi bit scores are predicted to be Gumbel distributed with parametric ^{5} i.i.d. random sequences of length 400 generated with the same residue frequencies as the null model

(A) A histogram showing the distribution of ^{5} i.i.d random sequences of length ^{8} i.i.d. random sequences of length

As examples, the top right of ^{8} random sequences). As “typical” models, I chose RRM_1 and Caudal_act from Pfam 22.0. The RRM_1 model is the RNA recognition motif, a ∼72 residue domain, chosen because it is one of the Pfam domains I am most familiar with. The Caudal_act domain is the activation domain of the Caudal-like homeobox transcription factors, chosen because it is literally typical for Pfam, being closest to the median of Pfam seed alignments in three different characteristics: number of seed sequences (Pfam 22.0 median = 9; Caudal_act = 9), model length (Pfam median = 147; Caudal_act = 147), and average pairwise identity (Pfam median 36%, Caudal_act = 37%). Both observed distributions show good agreement to the predicted Gumbel of

I examined outliers in ^{8} random

The low outlier DUF851 (and all other low outliers I examined) actually fits better visually to the conjectured

The high outlier Sulfakinin (and all other high outliers I examined) does show a higher

The Forward score distribution is predicted to converge to an exponential with

(A) a graph showing how ^{5} i.i.d random sequences of length ^{8} i.i.d. random sequences of length

The top right of ^{8}) simulations for the “typical” RRM_1 and Caudal_act Pfam models, showing that these fits are visually satisfactory.

In this case, the survey of 9,318 models has limited power to detect significant outliers. Even with

The lower right of ^{5} scores.

Some low outliers exhibit the same high-identity, discretized-scores, stairstepping-distribution artifact observed with the Viterbi low outliers (DUF851 for example; not shown), but this explanation does not seem reasonable for Ribosomal_L12, where the observed score distribution appears smooth. The Ribosomal_L12 discrepancy (

The high outlier XYPPX (and some other high outliers examined) remains a high ^{5} scores in the deeper tail). As with the Viterbi scores, XYPPX and these other high outliers are unusually small models (XYPPX is

So far, all target sequences have been a typical length of

For the old target length model parameterization in HMMER2 (

Log survival plots for multihit local Viterbi scores (left; [A,B,E,F]) and multihit local Forward scores (right; [C,D,G,H]) for the two “typical” models RRM_1 and Caudal_act, for ^{6} i.i.d. random sequences of various lengths, for either old HMMER2 scoring (top; [A–D]) or the new target length model in prototype HMMER3 (bottom; [E–H]). Eleven target sequence lengths are used, ranging from 25 to 25,600 in steps of two-fold, with

However, from a probabilistic inference standpoint, seeing the expected score increase with increasing target sequence length raises a red flag. The posterior probability

This concern becomes a practical issue when multihit local Forward score distributions are examined for models using the HMMER2 target length model, as shown in the top right of

A simple argument about the target length model appears to suffice to explain this behavior. Consider the length distribution generated by models

If we assume the length distribution of

Intuitively, this follows from the fact that the expected number of times that we include a J segment is

We can then approximate the component of the log-odds Forward score that is attributable to target length modeling alone:

In the case of unihit modes (

One way to “fix” this behavior would be to set

How can we set an uninformative target length model? One way to do this is to make the parameterization of models

Under this scheme, according to Equation 1, the length model is predicted to contribute a nearly constant score, independent of target sequence length

Target length independence is an important result. It not only means that single choices of location parameters

For the expected Gumbel distribution of local Viterbi scores, the location parameter

It is more difficult to efficiently determine the location parameter

After unsuccessfully exploring several alternative approaches, I adopted the following

Using

For Karlin/Altschul statistics, the apparent

In most of the results in

Two different approaches have been developed for correcting for edge effect. One approach is to use corrected query and target sequence lengths

I experimented with setting an edge-corrected target length model such that the flanking nonhomology states generate

Applying a correction to

This is only an empirically derived correction. It appears to suffice in practice, but there is clearly more going on here. A more satisfying and theoretically grounded accounting for edge effects in probabilistic local alignment is needed.

In summary, the overall procedure for estimating the expected score distributions is to assume

^{5} random sequences, of lengths ^{−4}. The minimum predicted E-values for each of the six searches (Forward vs. Viterbi, three choices of length) range from 2.2×10^{−4} down to 3.7×10^{−6}, basically within expectation (the 3.7×10^{−6} is significantly low, but just barely so;

Plots of predicted E-value versus actual rank, for multihit local Viterbi scores (A,C) and multihit local Forward scores (B,D), using models with either the standard profile HMM multinomial parameterization used in the rest of the paper (A,B) or “entropy-weighted” models of reduced information content (C,D). Each plotted point (open circles) is the mean of 9,318 profile HMM searches of ^{5} target sequences of three different target lengths:

Though statistically significant errors in E-value accuracy remain, for practical purposes they are tolerably small. Moreover, they are almost invariably in the conservative direction. That is, we would rather slightly underestimate

The most immediate benefits from this work are that for profile HMM searches, the statistical significance of both Viterbi and Forward scores can be calculated efficiently without expensive simulation. This enables substantial accelerations in the use of Viterbi scores, and more importantly, it opens the way to a broader use of more powerful Forward scores.

Although I have done the simulations in the specific context of HMMER, the local alignment model is not specific to HMMER. It is a generalized probabilistic local alignment model with a uniform entry/exit distribution. Because position-independent substitution matrix scores and gap costs are just a special case of position-specific profile scores, the same model can be used to parameterize standard Smith/Waterman local alignments

The same conjectures are also expected to hold for local alignment scores for probability models of more than just linear sequence alignment. For example, our preliminary results indicate that local alignment scores for profile stochastic-context free grammars (SCFGs; models of RNA structure and sequence) obey the same conjectures for both CYK and Inside scores (analogous to local Viterbi and Forward scores) (DL Kolbe and SRE, unpublished results), which should help in efficiently and accurately calculating E-values for profile SCFG searches for structural RNAs

However, at least three important points limit any conclusions I can try to draw about how widely the conjectures might hold.

First, the same conjectures ought to hold for glocal and global alignment models. Nothing in the conjectures' rationale required the probabilistic models ^{−t log z} exponential tails for Forward scores only for the smallest HMMs, the largest target sequences, and the most extreme tails

Second, if any probabilistic local alignment model

Third, it is trivial to produce an example of a probabilistic model

These issues show the main limitation of the simulation-based approach I have taken. Proper understanding of the regimes in which the conjectures break down requires a mathematical analysis, not simulations limited to a particular problem domain. Such analysis would be desirable, and it could lead in fruitful new directions. For example, the fact that HMMER3 glocal score distributions do appear to asymptote towards the conjectures (albeit not for a practical range of tail probability mass nor query and target lengths) seems promising. A general approach for estimating statistical significance of global or glocal gapped alignment scores, under traditional (arbitrary) scoring systems, largely remains elusive, despite significant effort and progress

Another problem that will need more attention is finite length effects. The finite length edge effect described for BLAST scores

This work was partly inspired by the work of Yu and Hwa, who described a “hybrid” (or “semi-probabilistic”) scoring method that gives Gumbel-distributed scores with

I have taken care to distinguish Viterbi from Forward scores, and local from glocal or global alignment modes, all of which are just choices in the same full probabilistic modeling framework. Some prior work has conflated probabilistic modeling and Forward scoring, referring to Forward scores as “probabilistic alignment scores” and arguing that probabilistic alignment scores do not follow Gumbel distributions as opposed to traditional alignment scores

Although most homology search methods are based on local alignment, our previous internal HMMER2 benchmarks and benchmarks of other methods

It is important to distinguish generative probabilistic models of local alignment from other “probabilistic” local alignment methods that apply renormalization and partition functions to interpret traditional arbitrary scores as unnormalized log-odds probabilities

A limitation of this work is that I have only examined scores of independent, identically distributed (i.i.d.) random sequences with a single typical amino acid composition. Real sequences often have biased residue composition, repetitive regions, and other heterogeneities that can produce spurious high-scoring aligments, requiring additional methods to compensate

From a purist Bayesian perspective

The HMMER3 prototype source code (together with Easel, a code library that HMMER depends on) is freely available at

In HMMER3's implementation, the local entry/exit distribution is in fact not completely uniform, for the following reason. Imagine (as an extreme illustration) a profile HMM with a “consensus” match state _{k}_{k}_{−1}→_{k}_{k}_{k}_{k}

It was necessary to implement HMMER3 dynamic programming routines as floating point calculations. In the target length model, a ratio like

All computational times mentioned in the paper are measured for a single execution thread on a 3.2 GHz Intel Xeon (Dempsey) CPU, using prototype HMMER3 code compiled with the GNU C compiler (gcc) version 3.4.5 with a -O2 optimization level, running a Red Hat Enterprise Linux AS release 4 operating system.

This work was initiated by many discussions, and was aided by some key contributions to HMMER's codebase. In particular, Bob Edgar (University of Californica Santa Cruz) and Bill Bruno (Los Alamos) both pointed out flaws in the HMMER2 local entry/exit distribution; Steve Johnson (Washington University) implemented entropy weighting methods and benchmarked Forward scores in HMMER as part of his Ph.D. thesis work; Alex Coventry (Howard Hughes Medical Institute Janelia Farm) first suggested that the extreme Forward tail appeared to be exponential; and Jeremy Buhler and Christopher Swope (Washington University) contributed highly optimized implementations of the profile HMM Viterbi and Forward algorithms which I adapted to the local model described here. I am especially endebted to Elena Rivas for her mathematical insights and forceful arguments.