The authors have declared that no competing interests exist.
Conceived and designed the experiments: JM SW JX. Performed the experiments: JM SW ZW. Analyzed the data: JM SW JX. Contributed reagents/materials/analysis tools: JM SW ZW. Wrote the paper: JM JX.
Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at
Sequence-based protein homology detection has been extensively studied, but it remains very challenging for remote homologs with divergent sequences. So far the most sensitive methods employ HMM-HMM comparison, which models a protein family using HMM (Hidden Markov Model) and then detects homologs using HMM-HMM alignment. HMM cannot model long-range residue interaction patterns and thus, carries very little information regarding the global 3D structure of a protein family. As such, HMM comparison is not sensitive enough for distantly-related homologs. In this paper, we present an MRF-MRF comparison method for homology detection. In particular, we model a protein family using Markov Random Fields (MRF) and then detect homologs by MRF-MRF alignment. Compared to HMM, MRFs are able to model long-range residue interaction pattern and thus, contains information for the overall 3D structure of a protein family. Consequently, MRF-MRF comparison is much more sensitive than HMM-HMM comparison. To implement MRF-MRF comparison, we have developed a new scoring function to measure the similarity of two MRFs and also an efficient ADMM algorithm to optimize the scoring function. Experiments confirm that MRF-MRF comparison indeed outperforms HMM-HMM comparison in terms of both alignment accuracy and remote homology detection, especially for mainly beta proteins.
This Methods article is associated with RECOMB 2014.
Sequence-based protein alignment and homology detection has been extensively studied and widely applied to many biological problems such as homology modeling
To significantly advance homology detection, this paper presents a Markov Random Fields (MRFs) modeling of a multiple sequence alignment (MSA). Compared to HMM, MRFs can model long-range residue interactions and thus, encodes information for the global 3D structure of a protein family. In particular, MRF is a graphical model encoding a probability distribution over the MSA by a graph and a set of preset statistical functions. A node in the MRF corresponds to one column in the MSA and one edge specifies correlation between two columns. Each node is associated with a function describing position-specific amino acid mutation pattern. Similarly, each edge is associated with a function describing correlated mutation statistics between two columns. With MRF representation, alignment of two proteins or protein families becomes that of two MRFs. To align two MRFs, a scoring function or alignment potential is needed to measure the similarity of two MRFs. We use a scoring function consists of both node alignment potential and edge alignment potential, which measure the node (i.e., amino acid) similarity and edge (i.e., interaction pattern) similarity, respectively.
It is computationally challenging to optimize a scoring function containing edge alignment potential. To deal with this, we formulate the MRF-MRF alignment problem as an integer programming problem and then develop an ADMM (Alternative Direction Method of Multipliers) algorithm to solve it efficiently to a suboptimal solution. ADMM divides the MRF alignment problem into two tractable sub-problems and then iteratively solve them until they reach consistent solutions.
Experiments show that our MRF-MRF alignment method, denoted as MRFalign, can generate more accurate alignments and is also much more sensitive than others in detecting remote homologs. MRFalign works particularly well on mainly-beta proteins.
Cowen has developed a program SMURFLite for fold recognition based upon the MRF representation of a protein family
Quite a few PSSM-based profile comparison methods for homology detection have been developed, including
To train the node alignment potential, we constructed the training and validation data from SCOP70. The sequence identity of all the training and validation protein pairs is uniformly distributed between 20% and 70%. Further, two proteins in any pair are similar at superfamily or fold level. In total we use a set of 1400 protein pairs as the training and validation data, which covers 458 SCOP folds
The data used to test alignment accuracy has no fold-level overlap with the training and validation data. In particular, we use the following three datasets to test the alignment accuracy, which are subsets of the test data used in Set3.6K: a set of 3617 non-redundant protein pairs. Two proteins in a pair share <40% sequence identity and have small length difference. By “non-redundant” we mean that in any two protein pairs, there are at least two proteins (one from each pair) sharing less than 25% sequence identity. Set2.6K: a set of 2633 non-redundant protein pairs. Two proteins in a pair share <25% sequence identity and have length difference larger than 30%. This set is mainly used to test the performance of one method in handling with domain boundary. Set60K: a very large set of 60929 protein pairs, in most of which two proteins share less than 40% sequence identity. Meanwhile, 846, 40902, and 19181 pairs are similar at the SCOP family, superfamily and fold level, respectively, and 151, 2691 and 2218 pairs consist of only all-beta proteins, respectively.
We use the following benchmarks to test remote homology detection success rate.
SCOP20, SCOP40 and SCOP80, which are used by Söding group to study context-specific mutation score
We run PSI-BLAST with 5 iterations to detect sequence homologs and generate MSAs for the first three datasets. The MSA files for the three SCOP benchmarks are downloaded from the HHpred website (
To evaluate alignment accuracy, we compare our method, denoted as MRFalign, with sequence-HMM alignment method HMMER
Three performance metrics are used including reference-dependent alignment precision, alignment recall and homology detection success rate. Alignment precision is defined as the fraction of aligned positions that are correctly aligned. Alignment recall is the fraction of alignable residues that are correctly aligned. Reference alignments are used to judge if one residue is correctly aligned or alignable. To reduce bias, we use three very different structure alignment tools to generate reference alignments, including TM-align
As shown in
TMalign | Matt | DeepAlign | ||||
Exact match | 4-offset | Exact match | 4-offset | Exact match | 4-offset | |
HMMER | 22.9% | 26.5% | 24.1% | 27.4% | 25.5% | 28.1% |
HHalign | 36.3% | 39.1% | 37.0% | 42.1% | 38.4% | 42.8% |
MRFalign |
Three structure alignment tools (TMalign, Matt and DeepAlign) are used to generate reference alignments. “4-offset” means that 4-position off the exact match is allowed. The bold indicates the best results.
TMalign | Matt | DeepAlign | ||||
Exact match | 4-offset | Exact match | 4-offset | Exact match | 4-offset | |
HMMER | 36.5% | 42.6% | 38.6% | 44.0% | 40.4% | 45.0% |
HHalign | 62.5% | 66.1% | 63.2% | 66.2% | 64.0% | 66.7% |
MRFalign |
See
On the very large set Set60K, as shown in
TMalign | Matt | DeepAlign | |||||||
HMMER | HHalign | MRFalign | HMMER | HHalign | MRFalign | HMMER | HHalign | MRFalign | |
Family | 57.4% | 69.2% | 59.1% | 70.5% | 63.2% | 72.6% | |||
Superfamily | 31.2% | 42.0% | 32.3% | 42.4% | 32.8% | 49.4% | |||
Fold | 1.3% | 7.0% | 1.6% | 8.0% | 2.0% | 8.7% | |||
Family (beta) | 60.9% | 69.9% | 64.0% | 75.1% | 68.4% | 79.0% | |||
Superfamily (beta) | 35.0% | 47.2% | 37.0% | 50.2% | 39.1% | 52.9% | |||
Fold (beta) | 2.5% | 8.3% | 3.0% | 9.1% | 4.0% | 10.1% |
The protein pairs are divided into 3 groups based upon the SCOP classification. The bold indicates the best results.
Our method outperforms HHalign and HMMER by ∼3% and ∼12%, respectively, at the family level; ∼7% and ∼19%, respectively, at the superfamily level; and ∼10% and ∼16%, respectively, at the fold level, regardless of reference alignments.
As shown in
TMalign | Matt | DeepAlign | ||||
Exact match | 4-offset | Exact match | 4-offset | Exact match | 4-offset | |
HMMER | 29.3% | 34.1% | 29.6% | 34.7% | 31.5% | 35.6% |
HHalign | 35.9% | 39.4% | 36.2% | 39.4% | 37.2% | 41.7% |
MRFalign |
Three structure alignment tools (TMalign, Matt and DeepAlign) are used to generate reference alignments. “4-offset” means that 4-position off the exact match is allowed. The bold indicates the best results.
TMalign | Matt | DeepAlign | ||||
Exact match | 4-offset | Exact match | 4-offset | Exact match | 4-offset | |
HMMER | 48.0% | 50.1% | 48.2% | 50.3% | 51.4% | 54.8% |
HHalign | 57.1% | 59.9% | 57.3% | 60.0% | 58.3% | 61.4% |
MRFalign |
See
On the very large set Set60K, as shown in
TMalign | Matt | DeepAlign | |||||||
HMMER | HHalign | MRFalign | HMMER | HHalign | MRFalign | HMMER | HHalign | MRFalign | |
Family | 63.1% | 63.9% | 64.3% | 65.4% | 68.4% | 69.2% | |||
Superfamily | 38.7% | 39.5% | 40.5% | 41.3% | 43.2% | 44.3% | |||
Fold | 4.2% | 7.4% | 4.7% | 8.0% | 5.4% | 8.2% | |||
Family (beta) | 66.4% | 65.8% | 67.4% | 68.1% | 70.8% | 72.4% | |||
Superfamily (beta) | 44.2% | 44.9% | 45.4% | 46.2% | 46.6% | 48.4% | |||
Fold (beta) | 6.1% | 9.3% | 6.7% | 9.2% | 7.9% | 8.6% |
The protein pairs are divided into 3 groups based upon the SCOP classification. The bold indicates the best results.
To evaluate homology detection rate, we employ three benchmarks SCOP20, SCOP40 and SCOP80 introduced in
As shown in
Scop20 | Scop40 | Scop80 | |||||||
Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | |
hmmscan | 35.2% | 36.5% | 36.5% | 40.2% | 41.7% | 41.8% | 43.9% | 45.2% | 45.3% |
FFAS | 48.6% | 54.4% | 55.6% | 52.1% | 56.3% | 57.1% | 49.8% | 53.0% | 53.7% |
HHsearch | 51.6% | 57.3% | 59.2% | 55.8% | 60.8% | 62.4% | 56.1% | 60.1% | 61.8% |
HHblits | 51.9% | 56.3% | 57.5% | 56.0% | 59.8% | 60.9% | 59.2% | 62.5% | 63.3% |
MRFalign |
Scop20 | Scop40 | Scop80 | |||||||
Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | |
hmmscan | 5.2% | 6.1% | 6.1% | 6.2% | 6.9% | 6.9% | 5.9% | 6.5% | 6.6% |
FFAS | 13.1% | 18.7% | 20.0% | 10.4% | 14.5% | 15.4% | 9.1% | 11.9% | 12.6% |
HHsearch | 16.3% | 24.7% | 28.6% | 17.6% | 25.3% | 29.1% | 15.4% | 21.9% | 25.0% |
HHblits | 17.4% | 25.2% | 27.2% | 19.1% | 26.0% | 28.2% | 18.4% | 25.0% | 27.0% |
MRFalign |
Similar to alignment accuracy, our method for homology detection also has a larger advantage on the beta proteins. In particular, as shown in
Scop20 | Scop40 | Scop80 | |||||||
Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | |
hmmscan | 29.1% | 29.4% | 29.4% | 34.7% | 35.1% | 35.1% | 43.7% | 44.0% | 44.1% |
FFAS | 43.6% | 49.9% | 51.9% | 48.2% | 52.4% | 53.5% | 43.7% | 46.3% | 47.2% |
HHsearch | 48.2% | 54.6% | 56.9% | 52.0% | 56.9% | 59.1% | 47.7% | 51.8% | 53.7% |
HHblits | 47.5% | 52.1% | 53.7% | 51.4% | 54.8% | 56.6% | 52.9% | 54.6% | 57.8% |
MRFalign |
Scop20 | Scop40 | Scop80 | |||||||
Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | |
hmmscan | 6.9% | 7.6% | 7.6% | 8.0% | 8.6% | 8.6% | 7.0% | 7.4% | 7.4% |
FFAS | 22.7% | 30.1% | 31.8% | 15.2% | 20.4% | 21.7% | 11.8% | 15.3% | 16.1% |
HHsearch | 24.4% | 34.7% | 38.8% | 26.8% | 37.7% | 41.6% | 19.1% | 26.8% | 29.5% |
HHblits | 24.1% | 33.3% | 34.8% | 26.9% | 35.3% | 37.1% | 24.7% | 34.1% | 35.5% |
MRFalign |
To evaluate the contribution of our edge alignment potential, we calculate the alignment recall improvement resulting from using edge alignment potential on two benchmarks Set3.6K and Set2.6K. As shown in
Alignment recall for the whole test sets | ||||
Set3.6K | Set2.6K | |||
Exact Match | 4-position offset | Exact Match | 4-position offset | |
Only with node potential | 44.7% | 48.6% | 68.6% | 71.8% |
Node + edge potential, no MI | 48.1% | 52.2% | 72.3% | 75.2% |
Node + edge potential with MI | 49.2% | 53.5% | 74.2% | 77.8% |
Alignment recall on proteins with at least 256 non-redundant sequence homologs | ||||
391 pairs in Set3.6K | 509 pairs in Set2.6K | |||
Only with node potential | 59.5% | 63.4% | 71.3% | 75.8% |
Node + edge potential, no MI | 62.1% | 66.7% | 73.5% | 78.1% |
Node + edge potential with MI | 65.2% | 69.8% | 76.6% | 81.0% |
The structure alignments generated by DeepAlign are used as reference alignments.
The X-axis is the geometric mean of the two protein lengths in a protein pair. The Y-axis is the running time in seconds.
We conducted two experiments to show that our MRFalign is not overtrained. In the first experiment, we used 36 CASP10 hard targets as the test data. Our training set was built before CASP10 started, so there is no redundancy between the CASP10 hard targets and our training data. Using MRFalign and HHpred, respectively, we search each of these 36 test targets against PDB25 to find the best match. Since PDB25 does not contain proteins very similar to many of the test targets, we built a 3D model using MODELLER from the alignment between a test target and its best match and then measure the quality of the model. As shown in
One point represents two models generated by our method (x-axis) and HHpred (y-axis).
In the second experiment, we divide the proteins in SCOP40 into three subsets according their similarity with all the training data. We measure the similarity of one test protein with all the training data by its best BLAST E-value. We used two values 1e-2 and 1e-35 as the E-value cutoff so that the three subsets have roughly the same size. As shown in
E-value<1e-35 | 1e-35<E-value<1e-2 | E-value>1e-2 | |||||||
Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | Top1 | Top5 | Top10 | |
hmmscan | 5.0% | 5.6% | 5.6% | 7.3% | 7.9% | 7.9% | 6.4% | 7.3% | 7.4% |
10.3% | 14.5% | 15.8% | 9.7% | 12.9% | 13.5% | 11.6% | 16.5% | 17.5% | |
HHsearch | 16.0% | 23.2% | 26.5% | 18.5% | 26.2% | 30.3% | 18.9% | 27.2% | 31.7% |
HHblits | 16.9% | 23.1% | 25.5% | 20.8% | 27.4% | 28.9% | 20.2% | 28.3% | 31.1% |
MRFalign |
This paper has presented a new method for sequence-based protein homology detection that compares two protein sequences or families through alignment of two Markov Random Fields (MRFs), which model the multiple sequence alignment (MSA) of a protein family using an undirected general graph in a probabilistic way. The MRF representation is better than the extensively-used PSSM and HMM representations in that the former can capture long-range residue interaction pattern, which reflects the overall 3D structure of a protein family. As such, MRF comparison is much more sensitive than HMM comparison in detecting remote homologs. This is validated by our large-scale experimental tests showing that MRF-MRF comparison can greatly improve alignment accuracy and remote homology detection over currently popular sequence-HMM, PSSM-PSSM, and HMM-HMM comparison methods. Our method also has a larger advantage over the others on mainly-beta proteins.
We build our MRF model of a protein family based upon multiple sequence alignment (MSA) in the absence of native structures. The accuracy of the MRF model depends on the accuracy of an MSA. Currently we rely on the MSA generated by PSI-BLAST. In the future, we may explore better alignment methods for MSA building or even utilize solved structures of one or two protein sequences to improve MSA. The accuracy of the MRF model parameter usually increases with respect to the number of non-redundant sequence homologs in the MSA. Along with more and more protein sequences are generated by a variety of sequencing projects, we shall be able to build accurate MRFs for more and more protein families and thus, detect their homologous relationship more accurately.
An accurate scoring function is essential to MRF-MRF comparison. Many different methods can be used to measure node and edge similarity of two MRFs, just like many different scoring functions can be used to measure the similarity of two PSSMs or HMMs. This paper presents only one of them. In the future we may explore more possibilities. It is computationally intractable to find the best alignment between two MRFs when edge similarity is taken into consideration. This paper presents an ADMM algorithm that can efficiently solve the MRF-MRF alignment problem to suboptimal. However, this algorithm currently is about 10 times slower than the Viterbi algorithm for PSSM-PSSM alignment. Further tuning of this ADMM algorithm is needed for very large-scale homology detection.
Given a protein primary sequence, we run PSI-BLAST
We use two kinds of information in MRFs for their alignment. One is the occurring probability of 20 amino acids and gap at each node (i.e., each column in MSA), which can also be interpreted as the marginal probability at each node. The other is the correlation between two nodes, which can be interpreted as interaction strength of two MSA columns and calculated by several different ways. For example, we can use a contact prediction program such as PSICOV
Our scoring function for MRF-MRF alignment is a linear combination of node alignment potential and edge alignment potential with equal weight. Let
(A) Represented as a sequence of states. (B) Each alignment is a path in the alignment matrix.
Given an alignment path, its node alignment potential is the accumulative potential of all the vertices in the path. We use a Conditional Neural Fields (CNF)
Let
It calculates the similarity of two edges, one from each MRF, based upon the interaction strength of two ends in one edge. We can derive interaction strength from the parameters of the MRF model, but it is hard to validate if this interaction strength (or mutual information) is accurate or not even in the presence of native structures since we cannot directly measure interaction strength in a protein. Here we use inter-residue Euclidean distance, which can be measured more easily, to reflect interaction strength of two residues. Later in this section we will describe how to derive the distance probability distribution from the information (e.g., interaction strength) encoded in MRFs. Let
Now we explain how to calculate each term in
In
As mentioned before, an alignment can be represented as a path in the alignment matrix, which encodes an exponential number of paths. We can use a set of
It is computationally intractable to find the optimal solution of P1. Below we present an ADMM (Alternating Direction Method of Multipliers) method that can efficiently solve this problem to suboptimal. See
Adding the constraint
Since both
The sub-problem SP1 optimizes the objective function with respect to
In summary, we solve P4 using the following procedure. Initialize Solve (SP1) first and then (SP2) using dynamic programming, each generating a feasible alignment. If the algorithm converges, i.e., the difference between
Due to the quadratic penalty term in P3 this ADMM algorithm usually converges much faster and also yields better solutions than without this term. Empirically, it converges within 10 iterations for most protein pairs. See
A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5