Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.


Removal of domain overlaps for better FDR estimates.
Our q-values and lFDRs are only correct if they are computed after removing domain overlaps, otherwise noise is underestimated because the number of predictions is overestimated. We remove all overlaps between domains by only keeping those with the smallest p-values; while this worked well in practice, future research should address whether more powerful approaches to overlap removal exist.
Adaptation of q-value and lFDR methods for filtered p-values. Existing implementations of q-value (R package qvalue) and lFDR (R packages qvalue and fdrtool) require a complete list of pvalues (to estimate π 0 ), which HMMER3 does not provide [7]. We reimplemented the algorithms to overcome this problem, defining the necessary parameters in other ways while leaving most of the procedures unchanged. The number of tests N is the number of proteins for sequence p-values, and for domain p-values it is the number of proteins plus the additional number of domains predicted in each sequence when there are more than one. The implied value n(1)=N is added to ensure that q (t)≤1 and l FDR (t )≤1 . The π 0 cannot be estimated because the distribution of p-values near 1 is necessary [1][2][3]. We set π 0 =1 , a reasonable approximation since any given family is not present in the majority of proteins. Lastly, the constraints of eq. 1 are replaced with N t≤n(t ) . Beside these changes, q-values and lFDRs are estimated as described before.
Equal stratified lFDRs are also optimal when controlling the combined E-value. We want to maximize the number of predictions, as in the Results (consult for notation), but here we constrain the combined E-value to a maximum value of E, or ∑ i t i π 0,i N i ≤E , and the Lagrangian multiplier function is then where only Λ and λ differ from those of the FDR problem. A necessary condition for optimality is again showing that the lFDR of each stratum must be equal. Method rankings are preserved between E-value and FDR plots. We present the outline of the proof that method rankings are preserved in plots with the number of predictions on the y-axis and either the FDR or E-value on the x-axis. a) Definition. Given two methods A, B with functions f A , f B in the same domain X, A ranks Method A ranks higher than B with respect to FDRs if and only if A ranks higher than B with respect to E-values. c) Proof. First, observe that if both f A and f B are invertible and monotonically increasing, A ranks higher than B implies that the inverse of A ranks lower than the inverse of B. The FDR and Evalue functions are monotonically increasing, and for simplicity we assume them to be invertible (although real data is discrete, for a given number of predictions, an inverse function can be defined that returns the lowest FDR or E-value when there are ties, and we obtain a continuous function by interpolation). Thus, it suffices to show that rankings are preserved in the inverse functions when we go from FDRs to E-values.
For a given method A, its inverse FDR and E-value curves FDR A (n) and E A (n), which are functions of the number of predictions n, are related by FDR A (n)=E A (n)/ n . It follows that the ranking of the inverses of the two methods A and B is preserved, because FDR A (n)≥FDR B (n) ⇔ E A (n)≥E B (n) , which follows since n>0.
Tiered stratified q-value thresholds. Under a simple independence assumption, the final FDR of tiered q-values is approximately the sum of the per-tier q-value thresholds if both are small.
The final FDR tiered is characterized in terms of Bayesian posterior error rates [8], and the joint distribution of sequence q-values (q seq ) and domain q-values conditional on the sequence threshold (q dom|seq ). A significant domain prediction satisfies both thresholds Q dom and Q dom|seq , it is truly false if the null hypothesis is true for either the sequence-level (H seq =0) or domain-level (H dom =0) tests, and in this model the probability of these events is the same for every significant prediction [8]. Therefore, Above we assume that H dom and H seq are conditionally independent (eq. 2), and we use the fact that H seq is independent of the domain threshold (eq. 3). Thus, FDR tiered ≈Q seq +Q dom|seq for small per-tier thresholds, since Q seq Q dom|seq is negligible. The full equation is an approximation that breaks down in extreme cases when H dom and H seq are not conditionally independent (for example, Q seq =1 should simplify to single-tier domain thresholds and we should have FDR tiered =Q dom|seq , but these equations instead predict FDR tiered =1; also, Q seq =0 and Q dom|seq =0 separately give FDR tiered =Q dom|seq and FDR tiered =Q seq respectively, but FDR tiered should go to zero for both cases, in the limit that fewer predictions are made).
However, our assumption of conditional independence is reasonable for less extreme cases, and it performs well empirically.
Sequence datasets. Here UniProt [9] is version 2010_05 (11,384,036 proteins), which accompanies Pfam 25. UniRef50 is version 2011_04 (11 months newer than UniProt above; 3,865,311 proteins), and is obtained by filtering UniProt 2011_04 to 50% identity with CD-HIT [10]. OrthoMCL5 is the set of proteins of OrthoMCL [11] version 5 with pseudogenes removed, then each orthologous group was filtered to 50% identity with CD-HIT, removing proteins without domains matching across any ortholog with domain p≤1e-4 (see OrthoC in Methods) and singleton groups, leading to 379,010 proteins in 58,129 final orthologous groups.
Pfam. We use the 12,273 HMMs and curated thresholds ("Standard Pfam") of Pfam 25. However, we extend and update the Pfam clan definitions with those of Pfam 26 to improve the quality of our empirical FDR tests (particularly ClanOv). Since the Standard Pfam thresholds correspond to a fixed FDR, they are "extended" by shifting them by constant amounts as described previously [12], allowing us to explore a range of FDRs.
Permissive overlaps. To avoid domain loss due to errors in domain boundaries, we define overlaps "permissively": two domains overlap if the intersection of their ranges is larger than 40 amino acids or larger than 50% of the smallest of the two ranges [13].
Nested families. Two families in the "nesting" network are connected if their domains may overlap without removal of either, and both may be TPs in the ClanOv test. We allow all overlaps between these families, not just strict nesting. Our nested families are those that overlap with the Standard Pfam thresholds on UniProt and UniRef50, extended using clans: if A-B is in the network, and C and D are of the same clan as A and B, respectively, then C-D is added to the network.
Context families. Family pairs observed with the Standard Pfam thresholds without overlaps on UniProt and UniRef50 form our context network L, extended using clans: if A-B is in L, and C and D are in the same clan as A and B, respectively, then C-D is added to L.
Domain information content score based on the Gene Ontology. We reimplemented the MultiPfam2GO [14] procedure in-house to be more efficient. In summary, we computed statistical associations between Gene Ontology (GO) [15] terms version 2011-09-20 on UniProt and domain sets from Pfam 25 on UniProt. GO terms from all three standard ontologies were used. A naive Bayes classifier for each GO term uses a protein's domain content to predict whether the sequence is annotated by the term. The information content [16] of each GO term i is I i GO =−log 2 p i , where p i is the fraction of proteins with GO terms in UniProt that have GO term i. The "GO information content" of a domain architecture is the sum of I i GO of leaf GO terms only.
Ortholog Set Coherence (OrthoC). This test extends our previously-published "Ortholog Coherence" test [12]. Given a set of orthologs from OrthoMCL5 [11], a domain predicted in one sequence (after setting thresholds and removing overlaps) is a TP if any domain of the same clan is present anywhere in at least one other ortholog with p ≤1e-4; it is a FP otherwise (Fig A). To prevent high-identity sequence pairs from producing correlated FPs that supports each other and are scored as TPs, we used CD-HIT on each group of orthologs so that no two sequences had greater than 50% identity [10]. Lastly, proteins with every domain labeled FP at p ≤ 1e-4 are removed from the test (we assume they are false orthologs).
Second-order Markov random sequences (MarkovR). This test extends the standard random sequence null model, and is an improved version of our previously published test [12]. In summary, random sequences are constructed using a second-order Markov model derived from UniRef50, which generate null domains (Fig A). This random sequence model preserves local amino acid correlations and should better model low-complexity regions than the standard random sequence null model, which assumes independence between positions.
The m th order Markov model is parametrized by the equivalent k-mer distribution with k=m+1, which is obtained from sequences with these filters: we ignore the initial M (methionine) when present, ignore k-mers with ambiguous amino acid codes (B,J,X,Z), and replace the rare amino acids U and O with C and K, respectively. Random sequences are initialized with M and a random k-mer, and every following amino acid is chosen from the conditional distribution given the previous (k-1)-mer. The sequence terminates when the desired length is reached. Our RandProt 1.0 software, which generates these random protein sequences, is available at https://github.com/alexviiia/RandProt. Every real sequence is matched to a random sequence of the same length, and their domain predictions are combined with their ranges preserved. Domains from the real sequence are TPs, and from the random sequence are FPs. Each method produces predictions on this set by setting thresholds and removing overlaps.
We assume FPs arise in real sequences with equal frequency as in random sequences. However, every domain from a real sequence will be counted as a TP; we double the measured FDR (and Evalue) to correct for this effect. More formally, we measure quantities TP', FP', which are related to the true quantities by E[FP'] = FP/2 and E[TP'] = TP+FP/2, the second of which follows since the total prediction counts are fixed, that is TP'+FP'=TP+FP. Applying these transformations to the pFDR, treating the denominator as fixed, gives E[pFDR']=pFDR/2, and thus doubling is necessary.
Reverse Sequence (RevSeq). This test is based on the fact that the reverse of a protein sequence is very unlikely itself to be a functional protein, so it is a simple way of generating proteins without any true domains. Moreover, reversed sequences preserve repetitive and low-complexity regions, which are known to be problematic for homology prediction, so they may be used to generate more realistic null statistics than the standard null model (which generates repetitive and lowcomplexity regions more rarely) [17].
RevSeq has one problematic case: some domain families have truly symmetric subregions, which we do not wish to penalize. We address this problem by removing domain predictions from the reverse sequence that overlap (in the strict sense) predictions of the same family from the real sequence. Overlaps were determined by reversing the coordinates of the prediction from the reverse sequence, so that there is a subsequence that is predicting the same domain family in both the forward and reverse orientations. This filter removed 6.2% of the reverse sequence domain predictions. This FDR test proceeds identically to MarkovR, merging the domains from the real sequence with those from the reverse sequence (Fig A). The measured FDR is also adjusted by a factor of 2.

Supplementary results
The Standard Pfam thresholds in terms of stratified q-values and lFDRs. We want q-values and lFDR estimates that are comparable to the Standard Pfam bitscore thresholds. For each family, its p-values on UniRef50 are assigned q-values and lFDRs. The p-value of each family's bitscore threshold is given by the Extreme Value Distribution with parameters from each HMM file. Lastly, the bitscore is assigned a q-value and lFDR by linear interpolation.
It has been reported [18] that Pfam families with thresholds of 25 and 27 bits are the default threshold values ("uncurated"), and the rest of the families ("curated") tend to have higher p-value thresholds. We observed a sharp p-value distribution for these thresholds, which is bimodal for the full data, and unimodal for the "curated" subset (median p=1.3e-8), in agreement with previous results [18] (Fig N). The q-value distribution of these thresholds is less sharp, and changes less when excluding the "uncurated" subset, with a "curated" median q=4.2e-4.
The lFDR distribution has modes at 0 and 1 (Fig N). 16% of Pfam families have lFDR≥0.5 , so their worst predictions should probably be false. However, the Pfam thresholds of these families fell on a region of the p-value distribution where there were few domains identified due to overlap removal (data not shown). Thus, our density estimates for these thresholds were flawed. We believe these Standard Pfam thresholds have reasonable lFDRs but we are unable to compute them accurately. Threshold setting on observed data works well, since density estimation on the observed p-values is more accurate; only interpolation is a problem for thresholds of families with few or no smaller pvalues observed. When these undesirable thresholds are excluded, we find a median "curated" lFDR threshold of 2.3e-2.
Pfam threshold curation somewhat compensates for underestimated noise. Here we investigate to what extent the Standard Pfam curated thresholds compensate for underestimated FDRs. We compare the q-value distributions of the Standard Pfam thresholds for the increased-noise and asexpected noise families separately. We find that increased-noise families have more stringent thresholds than as-expected-noise families under the Standard Pfam, with median q-values differing by a factor of 19. However, there is no clear separation between both q-value distributions (Fig J). In particular, 15% of the increased-noise families have a q-value threshold larger than the median asexpected-noise threshold. Overall, while Pfam has more stringent thresholds for increased-noise families than for families with as-expected noise, many increased-noise family thresholds remain more permissive than they should be.
Evaluation of empirical FDR tests. ClanOv identified the most increased-noise families, and empirical FDRs agree best with q-values in as-expected-noise families, as desired (Fig K). The 27 reduced noise families also had the largest differences between q-values computed before and after overlap removal (data not shown), so this subset may be an artifact: for q-values we remove all overlaps, but in ClanOv only within-clan overlaps are removed, leading to underestimated FDRs.
ContextC also identified increased-noise families, since observed domain pairs contain strong biological information. Increased-noise families are usually correctly assigned, since true domain pairs are almost always observed [19]. However, this test may underestimate the FDR, since it had the most decreased-noise families. One explanation is that true FP domains may often be singletons, which will be labeled erroneously as TPs. Therefore, this test is useful to study increased-noise, but not decreasednoise, families.
OrthoC successfully identified increased-noise families. However, its empirical FDRs had the least agreement with theory (Fig 5), even in families with as-expected noise (Fig K). Inspection suggests the problems are due to false ortholog assignments and orthologs with differing domain architectures, both of which mislabel TPs as FPs. Removal of the families with the most extreme deviations (|LD|>2) in OrthoC did not lead to better agreement between the empirical FDR and qvalues (data not shown), so mislabeled domains did not concentrate in a few families. This test also has the smallest sample size (a third of UniRef50), since we only considered proteins with orthologs and with percent identities smaller than 50%. Therefore, OrthoC could be improved if ortholog predictions improve, if orthologs with differing architectures are identified and handled, and if the number of proteins increases.
RevSeq identifies many increased-noise families (Fig 7), and remarkably few decreased-noise families ( Table A). The simplicity of these "random" sequences is also very appealing. However, we needed to address the case of true symmetry in domains (Supp. Methods) in order to get accurate empirical FDRs.
The MarkovR random sequences are very similar to those that theoretically generate the pvalues: the latter is a 0 th order Markov model. Empirical FDRs and q-values agreed very well (Fig 5). A 4 th order Markov model behaves the same, but it risks including biological information (certain 5mers are predictive of domain families; data not shown). Therefore, higher-order Markov models are unlikely to be useful in the future.    The q-value equivalent of each Pfam sequence threshold was computed by using the relationships between bitscores, p-values and q-values observed on UniRef50. The blue and red q-value distributions (the density is of log 10 q) correspond to the "as-expected-noise" and "increased-noise" subsets of Pfam families, respectively (see Results). The dashed lines correspond to the medians of the distributions with matching colors. For visualization purposes, q-values smaller than 1e-10 were displayed as 1e-10.    Black histograms includes all Pfam families, while red "curated" excludes families with Pfam thresholds of 25 and 27 bits. Full lFDR histograms (note linear x-axis, other panels have log scales) contain values larger than 0.5 due to an artifact we are unable to rectify (Supp. Results), so the second histogram only includes lFDRs < 0.5 to find a stratified threshold that Pfam curators are comfortable with. Densities for panels with log scales are log 10 of each statistic. Histograms were truncated such that every p-value <1e-13 was set to that value, and similarly for q-values and lFDRs <1e-8.