Skip to main content
Advertisement
  • Loading metrics

MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

  • Jianzhu Ma,

    Affiliation Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America

  • Sheng Wang,

    Affiliation Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America

  • Zhiyong Wang,

    Affiliation Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America

  • Jinbo Xu

    j3xu@ttic.edu

    Affiliation Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America

Abstract

Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.

Author Summary

Sequence-based protein homology detection has been extensively studied, but it remains very challenging for remote homologs with divergent sequences. So far the most sensitive methods employ HMM-HMM comparison, which models a protein family using HMM (Hidden Markov Model) and then detects homologs using HMM-HMM alignment. HMM cannot model long-range residue interaction patterns and thus, carries very little information regarding the global 3D structure of a protein family. As such, HMM comparison is not sensitive enough for distantly-related homologs. In this paper, we present an MRF-MRF comparison method for homology detection. In particular, we model a protein family using Markov Random Fields (MRF) and then detect homologs by MRF-MRF alignment. Compared to HMM, MRFs are able to model long-range residue interaction pattern and thus, contains information for the overall 3D structure of a protein family. Consequently, MRF-MRF comparison is much more sensitive than HMM-HMM comparison. To implement MRF-MRF comparison, we have developed a new scoring function to measure the similarity of two MRFs and also an efficient ADMM algorithm to optimize the scoring function. Experiments confirm that MRF-MRF comparison indeed outperforms HMM-HMM comparison in terms of both alignment accuracy and remote homology detection, especially for mainly beta proteins.

This Methods article is associated with RECOMB 2014.

Introduction

Sequence-based protein alignment and homology detection has been extensively studied and widely applied to many biological problems such as homology modeling [1][4], phylogeny inference [5][7] and protein function prediction [8][10]. Although extensively studied, remote homology detection still remains very challenging, especially for homologs with divergent sequences. So far the most sensitive method for homology detection is based upon comparison of protein sequence profiles, which are usually derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. That is, instead of aligning two primary sequences, homologs can be detected by aligning protein sequence profiles. To facilitate comparison and alignment, an MSA is usually represented as a position-specific scoring matrix (PSSM) [11] or an HMM (Hidden Markov Model) [12], [13]. HMM is more sensitive than PSSM because 1) HMM contains position-specific gap information; and 2) HMM also takes into account correlation among sequentially adjacent residues. Sequence signature libraries [14] and intermediate sequence based methods [15], [16] are also developed to make use of evolutionary information of a protein. All these methods are sensitive to close homologs, but not good enough for remote homologs. The main issue of existing profile-based methods lies in that they make use of only position-specific amino acid mutation patterns and very short-range residue correlation, but not long-range residue interaction. However, remote homologs may have very divergent sequences and are only similar at the level of (long-range) residue interaction pattern, which is not encoded in current popular PSSM or HMM models.

To significantly advance homology detection, this paper presents a Markov Random Fields (MRFs) modeling of a multiple sequence alignment (MSA). Compared to HMM, MRFs can model long-range residue interactions and thus, encodes information for the global 3D structure of a protein family. In particular, MRF is a graphical model encoding a probability distribution over the MSA by a graph and a set of preset statistical functions. A node in the MRF corresponds to one column in the MSA and one edge specifies correlation between two columns. Each node is associated with a function describing position-specific amino acid mutation pattern. Similarly, each edge is associated with a function describing correlated mutation statistics between two columns. With MRF representation, alignment of two proteins or protein families becomes that of two MRFs. To align two MRFs, a scoring function or alignment potential is needed to measure the similarity of two MRFs. We use a scoring function consists of both node alignment potential and edge alignment potential, which measure the node (i.e., amino acid) similarity and edge (i.e., interaction pattern) similarity, respectively.

It is computationally challenging to optimize a scoring function containing edge alignment potential. To deal with this, we formulate the MRF-MRF alignment problem as an integer programming problem and then develop an ADMM (Alternative Direction Method of Multipliers) algorithm to solve it efficiently to a suboptimal solution. ADMM divides the MRF alignment problem into two tractable sub-problems and then iteratively solve them until they reach consistent solutions.

Experiments show that our MRF-MRF alignment method, denoted as MRFalign, can generate more accurate alignments and is also much more sensitive than others in detecting remote homologs. MRFalign works particularly well on mainly-beta proteins.

Related work

Cowen has developed a program SMURFLite for fold recognition based upon the MRF representation of a protein family [17]. Nevertheless, our MRFalign method is significantly different from SMURFLite in a couple of aspects: 1) SMURLite builds an MRF based upon multiple structure alignment instead of multiple sequence alignment (MSA). As such, it cannot apply to sequence-based homology detection in the absence of native structures. In contrast, our method builds MRFs purely based upon MSA and thus, applies to sequence-based protein alignment and homology detection; and 2) SMURLite can only align a single primary sequence to an MRF, while our method aligns two MRFs to yield higher sensitivity. This difference requires us to develop totally new methods to build MRFs from MSA, measure similarity of two MRFs, and optimize the MRF-MRF alignment potential.

Quite a few PSSM-based profile comparison methods for homology detection have been developed, including [11], [18][23]. Some studies such as [20] also combine phylogeny information with PSSM-based profile comparison. Homology detection can also be done without aligning proteins. For example, we can represent a protein sequence or profile as a feature vector and then search for homologs by comparing feature vectors. Early methods such as [24] usually conduct straightforward comparison of feature vectors, but are not very sensitive [25]. Improvement in these alignment-free methods results from the application of discriminative learning approaches such as SVM–Fisher [26], SVM-pairwise [27], SVM with the spectrum kernel [28] and SVM with the mismatch kernel [29]. These SVM-based methods are reported to outperform the simple feature comparison methods. Comparing to alignment-based homology detection, alignment-free methods are usually faster but less sensitive.

Results

Training and validation data

To train the node alignment potential, we constructed the training and validation data from SCOP70. The sequence identity of all the training and validation protein pairs is uniformly distributed between 20% and 70%. Further, two proteins in any pair are similar at superfamily or fold level. In total we use a set of 1400 protein pairs as the training and validation data, which covers 458 SCOP folds [30]. Five-fold cross validation is used to choose the hyper-parameter in our machine learning model. In particular, every time we choose 1000 out of the 1400 protein pairs as the training data and the remaining 400 pairs as the validation data such that there is no fold-level redundancy between the training and validation data. A training or validation protein has less than 400 residues and contains less than 10% of residues without 3D coordinates. The reference alignment for a protein pair is generated by a structure alignment tool DeepAlign [31]. Each reference alignment has fewer than 50 gap positions in the middle and the number of terminal gaps is less than 20% of the alignment length.

Test data

The data used to test alignment accuracy has no fold-level overlap with the training and validation data. In particular, we use the following three datasets to test the alignment accuracy, which are subsets of the test data used in [4] to benchmark protein modeling methods.

  1. Set3.6K: a set of 3617 non-redundant protein pairs. Two proteins in a pair share <40% sequence identity and have small length difference. By “non-redundant” we mean that in any two protein pairs, there are at least two proteins (one from each pair) sharing less than 25% sequence identity.
  2. Set2.6K: a set of 2633 non-redundant protein pairs. Two proteins in a pair share <25% sequence identity and have length difference larger than 30%. This set is mainly used to test the performance of one method in handling with domain boundary.
  3. Set60K: a very large set of 60929 protein pairs, in most of which two proteins share less than 40% sequence identity. Meanwhile, 846, 40902, and 19181 pairs are similar at the SCOP family, superfamily and fold level, respectively, and 151, 2691 and 2218 pairs consist of only all-beta proteins, respectively.
We use the following benchmarks to test remote homology detection success rate.

  1. SCOP20, SCOP40 and SCOP80, which are used by Söding group to study context-specific mutation score [32]. They are constructed by filtering the SCOP database with a maximum sequence identity of 20%, 40% and 80%, respectively. In total they have 4884, 7088, and 9867 proteins, respectively, and 1281, 1806, and 2734 beta proteins, respectively.

We run PSI-BLAST with 5 iterations to detect sequence homologs and generate MSAs for the first three datasets. The MSA files for the three SCOP benchmarks are downloaded from the HHpred website (ftp://toolkit.genzentrum.lmu.de/pub/). Pseudocounts are used in building sequence profiles. Real secondary structure information is not used since this paper focuses on sequence-based homology detection.

Programs to compare

To evaluate alignment accuracy, we compare our method, denoted as MRFalign, with sequence-HMM alignment method HMMER [12] and HMM-HMM alignment method HHalign [13]. HHMER is run with a default E-value threshold (10.0). HHalign is run with the option “-mact 0.1”. To evaluate the performance of homology detection, we compare MRFalign, with FFAS [11] (PSSM-PSSM comparison), hmmscan (sequence-HMM comparison) and HHsearch and HHblits [33] (HMM-HMM comparison). HHsearch and hmmscan use HHalign and HMMER, respectively, for protein alignment.

Evaluation criteria

Three performance metrics are used including reference-dependent alignment precision, alignment recall and homology detection success rate. Alignment precision is defined as the fraction of aligned positions that are correctly aligned. Alignment recall is the fraction of alignable residues that are correctly aligned. Reference alignments are used to judge if one residue is correctly aligned or alignable. To reduce bias, we use three very different structure alignment tools to generate reference alignments, including TM-align [34], Matt [35], and DeepAlign [31].

Reference-dependent alignment recall

As shown in Tables 1 and 2, our method MRFalign exceeds all the others regardless of the reference alignments on both dataset Set3.6K and Set2.6K. MRFalign outperforms HHalign by ∼10% on both datasets, and HHMER by ∼23% and ∼24%, respectively. If 4-position off the exact match is allowed in calculating alignment recall, MRFalign outperforms HHalign by ∼11% on both datasets, and HHMER by ∼25% and ∼33%, respectively.

thumbnail
Table 1. Reference-dependent alignment recall on Set3.6K.

https://doi.org/10.1371/journal.pcbi.1003500.t001

thumbnail
Table 2. Reference-dependent alignment recall on Set2.6K.

https://doi.org/10.1371/journal.pcbi.1003500.t002

On the very large set Set60K, as shown in Table 3, our method outperforms the other two in each SCOP classification regardless of the reference alignments used. MRFalign is only slightly better than HHalign at the family level, which is not surprising since it is easy to align two closely-related proteins. At the superfamily level, our method outperforms HHalign and HMMER by ∼6% and ∼18%, respectively. At the fold level, our method outperforms HHalign and HHMER by ∼7% and ∼14%, respectively.

thumbnail
Table 3. Reference-dependent alignment recall (exact match) on the large benchmark Set60K.

https://doi.org/10.1371/journal.pcbi.1003500.t003

Alignment recall for beta proteins.

Our method outperforms HHalign and HMMER by ∼3% and ∼12%, respectively, at the family level; ∼7% and ∼19%, respectively, at the superfamily level; and ∼10% and ∼16%, respectively, at the fold level, regardless of reference alignments.

Reference-dependent alignment precision

As shown in Tables 4 and 5, our method MRFalign exceeds all the others regardless of the reference alignments on both data sets Set3.6K and Set2.6K. MRFalign outperforms HHalign by ∼8% and ∼5%, respectively, and HMMER by ∼15% and ∼13%, respectively. If 4-position off the exact match is allowed in calculating alignment precision, MRFalign outperforms HHalign by ∼8% and ∼9%, and HMMER by ∼14% and ∼18% on Set3.6K and Set2.6K, respectively.

thumbnail
Table 4. Reference-dependent alignment precision on Se3.6K.

https://doi.org/10.1371/journal.pcbi.1003500.t004

thumbnail
Table 5. Reference-dependent alignment precision on Set2.6K.

https://doi.org/10.1371/journal.pcbi.1003500.t005

On the very large set Set60K, as shown in Table 6, our method outperforms the other two in each SCOP classification regardless of the reference alignments used. At the family level, our method outperforms HHalign and HMMER by ∼3% and ∼4%, respectively. At the superfamily level, our method outperforms HHalign and HMMER by ∼4% and ∼5%, respectively. At the fold level, our method outperforms HHalign and HHMER by ∼5% and ∼8%, respectively.

thumbnail
Table 6. Reference-dependent alignment precision (exact match) on the large benchmark Set60K.

https://doi.org/10.1371/journal.pcbi.1003500.t006

Homology detection success rate

To evaluate homology detection rate, we employ three benchmarks SCOP20, SCOP40 and SCOP80 introduced in [32]. For each protein sequence in one benchmark, we treat it as a query, align it to all the other proteins in the same benchmark and then examine if those with the best alignment scores are similar to the query or not. We also conducted homology detection experiments using hmmscan, FFAS, HHsearch and HHblits with default options. The success rate is measured at the superfamily and fold levels, respectively. When evaluating the success rate at the superfamily (fold) level, we exclude those proteins similar to the query at least at the family (superfamily) level. For each query protein, we examine the top 1-, 5- and 10-ranked proteins, respectively.

As shown in Table 7, tested on SCOP20, SCOP40 and SCOP80 at the superfamily level, our method MRFalign succeeds on ∼6%, ∼4% and ∼4% more query proteins than HHsearch, respectively, when only the first-ranked proteins are considered. As shown in Table 8, at the fold level, MRFalign succeeds on ∼11%, ∼11% and ∼12% more proteins than HHsearch, respectively, when only the first-ranked proteins are evaluated. At the superfamily level, SCOP20 is more challenging than the other two benchmarks because it contains fewer proteins similar at this level. Nevertheless, at the fold level, SCOP80 is slightly more challenging than the other two benchmarks maybe because it contains many more irrelevant proteins and thus, the chance of ranking false positives at top is higher.

thumbnail
Table 7. Homology detection performance at the superfamily level.

https://doi.org/10.1371/journal.pcbi.1003500.t007

thumbnail
Table 8. Homology detection performance at the fold level.

https://doi.org/10.1371/journal.pcbi.1003500.t008

Similar to alignment accuracy, our method for homology detection also has a larger advantage on the beta proteins. In particular, as shown in Table 9, tested on SCOP20, SCOP40 and SCOP80 at the superfamily level, MRFalign succeeds on ∼7%, ∼5% and ∼7% more proteins than HHsearch, respectively, when only the first-ranked proteins are evaluated. As shown in Table 10, at the fold level, MRFalign succeeds on ∼13%, ∼16% and ∼17% more proteins than HHsearch, respectively, when only the first-ranked proteins are evaluated. Note that in this experiment, only the query proteins are mainly-beta proteins, the subject proteins can be of any types. If we restrict the subject proteins to only beta proteins, the success rate increases further due to the reduction of false positives.

thumbnail
Table 9. Homology detection performance for mainly beta proteins at the superfamily level.

https://doi.org/10.1371/journal.pcbi.1003500.t009

thumbnail
Table 10. Homology detection performance for mainly beta proteins at the fold level.

https://doi.org/10.1371/journal.pcbi.1003500.t010

Contribution of edge alignment potential and mutual information

To evaluate the contribution of our edge alignment potential, we calculate the alignment recall improvement resulting from using edge alignment potential on two benchmarks Set3.6K and Set2.6K. As shown in Table 11, our edge alignment potential can improve alignment recall by 3.4% and 3.7%, respectively. When mutual information is used, we can further improve alignment recall by 1.1% and 1.9% on these two sets, respectively. Mutual information is mainly useful for proteins with many sequence homologs since it is close to 0 when there are few sequence homologs. As shown in Table 11, if only those proteins with at least 256 non-redundant sequence homologs are considered, the improvement resulting from mutual information is ∼3%.

thumbnail
Table 11. Contribution of edge alignment potential and mutual information, measured by alignment recall improvement on two benchmarks Set3.6K and Set2.6K.

https://doi.org/10.1371/journal.pcbi.1003500.t011

Running time

Figure 1 shows the running time of MRFalign with respect to protein length. As a control, we also show the running time of the Viterbi algorithm, which is used by our ADMM algorithm to generate alignment at each iteration. As shown in this figure, MRFalign is no more than 10 times slower than the Viterbi algorithm. To speed up homology detection, we first use the Viterbi algorithm to perform an initial search without considering edge alignment potential and keep only top 200 proteins, which are then subject to realignment and rerank by our MRFalign method. Therefore, although MRFalign may be very slow compared to the Viterbi algorithm, empirically we can do homology search only slightly slower than the Viterbi algorithm.

thumbnail
Figure 1. Running time of the Viterbi algorithm and our ADMM algorithm.

The X-axis is the geometric mean of the two protein lengths in a protein pair. The Y-axis is the running time in seconds.

https://doi.org/10.1371/journal.pcbi.1003500.g001

Is our MRFalign method overtrained?

We conducted two experiments to show that our MRFalign is not overtrained. In the first experiment, we used 36 CASP10 hard targets as the test data. Our training set was built before CASP10 started, so there is no redundancy between the CASP10 hard targets and our training data. Using MRFalign and HHpred, respectively, we search each of these 36 test targets against PDB25 to find the best match. Since PDB25 does not contain proteins very similar to many of the test targets, we built a 3D model using MODELLER from the alignment between a test target and its best match and then measure the quality of the model. As shown in Figure 2, MRFalign can yield much better 3D models than HHsearch for most of the targets. This implies that our method can generalize well to the test data not similar to the training data.

thumbnail
Figure 2. The model quality, measured by TM-score, of our method and HHpred for the 36 CASP10 hard targets.

One point represents two models generated by our method (x-axis) and HHpred (y-axis).

https://doi.org/10.1371/journal.pcbi.1003500.g002

In the second experiment, we divide the proteins in SCOP40 into three subsets according their similarity with all the training data. We measure the similarity of one test protein with all the training data by its best BLAST E-value. We used two values 1e-2 and 1e-35 as the E-value cutoff so that the three subsets have roughly the same size. As shown in Table 12, the advantage of our method in remote homology detection over HHpred is roughly same across the three subsets. Since HHpred is an unsupervised algorithm, this implies that the performance of our method is not correlated to the test-training similarity. Therefore, it is unlikely that our method is overfit by the training data.

thumbnail
Table 12. Fold recognition rate of our method on SCOP40, with respect to the similarity (measured by E-value) between the test data and the training data.

https://doi.org/10.1371/journal.pcbi.1003500.t012

Discussion

This paper has presented a new method for sequence-based protein homology detection that compares two protein sequences or families through alignment of two Markov Random Fields (MRFs), which model the multiple sequence alignment (MSA) of a protein family using an undirected general graph in a probabilistic way. The MRF representation is better than the extensively-used PSSM and HMM representations in that the former can capture long-range residue interaction pattern, which reflects the overall 3D structure of a protein family. As such, MRF comparison is much more sensitive than HMM comparison in detecting remote homologs. This is validated by our large-scale experimental tests showing that MRF-MRF comparison can greatly improve alignment accuracy and remote homology detection over currently popular sequence-HMM, PSSM-PSSM, and HMM-HMM comparison methods. Our method also has a larger advantage over the others on mainly-beta proteins.

We build our MRF model of a protein family based upon multiple sequence alignment (MSA) in the absence of native structures. The accuracy of the MRF model depends on the accuracy of an MSA. Currently we rely on the MSA generated by PSI-BLAST. In the future, we may explore better alignment methods for MSA building or even utilize solved structures of one or two protein sequences to improve MSA. The accuracy of the MRF model parameter usually increases with respect to the number of non-redundant sequence homologs in the MSA. Along with more and more protein sequences are generated by a variety of sequencing projects, we shall be able to build accurate MRFs for more and more protein families and thus, detect their homologous relationship more accurately.

An accurate scoring function is essential to MRF-MRF comparison. Many different methods can be used to measure node and edge similarity of two MRFs, just like many different scoring functions can be used to measure the similarity of two PSSMs or HMMs. This paper presents only one of them. In the future we may explore more possibilities. It is computationally intractable to find the best alignment between two MRFs when edge similarity is taken into consideration. This paper presents an ADMM algorithm that can efficiently solve the MRF-MRF alignment problem to suboptimal. However, this algorithm currently is about 10 times slower than the Viterbi algorithm for PSSM-PSSM alignment. Further tuning of this ADMM algorithm is needed for very large-scale homology detection.

Methods

Modeling Multiple Sequence Alignment (MSA) using Markov Random Fields (MRF)

Given a protein primary sequence, we run PSI-BLAST [36] with 5 iterations and E-value cutoff 0.001 to find its sequence homologs. PSI-BLAST also generates an MSA of the sequence homologs. Let be a finite discrete random variable representing the amino acid at column i in the MSA, taking values from 1 to 21, corresponding to 20 amino acids and gap. Then we can use a multivariate random variable , where N is the number of columns, to model the MSA. We use an MRF to define the probability distribution of X. MRF is an undirected graph that can be used to model a set of correlated random variables. As shown in Fig. 3, an MRF node represents one column in the MSA and an edge represents the correlation between two columns i and k when . We ignore very short-range correlation (i.e., ) since it is not very informative. The MRF consists of two types of functions: and , where is an amino acid preference function for node i and is a pairwise amino acid preference function for edge (i, k) that reflects interaction between two nodes. Then, the probability of observing a particular protein sequence X can be calculated as follows.(1)where Z is the normalization factor.

thumbnail
Figure 3. Model a multiple sequence alignment (left) by a Markov Random Fields (right).

https://doi.org/10.1371/journal.pcbi.1003500.g003

We use two kinds of information in MRFs for their alignment. One is the occurring probability of 20 amino acids and gap at each node (i.e., each column in MSA), which can also be interpreted as the marginal probability at each node. The other is the correlation between two nodes, which can be interpreted as interaction strength of two MSA columns and calculated by several different ways. For example, we can use a contact prediction program such as PSICOV [37] and PhyCMAP [38] for this purpose. PSICOV assumes that is a Gaussian distribution function and calculates the correlation between two columns by inverse covariance matrix. PhyCMAP takes sequence information (including mutual information) as input and predicts the probability of two residues forming a contact, which can be used to indicate the interaction strength of two columns. However, it takes time to run these programs, in current implementation we calculate the mutual information (MI) and its power series of two columns as interaction strength. That is, we use MI, MI2, …, MI11 to quantify all the pairwise interaction strength where MI is the mutual information matrix. The MI power series are much more informative than the MI alone, as tested in our contact prediction program PhyCMAP.

Scoring function for the alignment of two Markov Random Fields (MRFs)

Our scoring function for MRF-MRF alignment is a linear combination of node alignment potential and edge alignment potential with equal weight. Let T and S denote two MRFs for the two proteins under consideration. There are three possible alignment states M, and where M represents two nodes being aligned, denotes an insertion in T (i.e., one node in T is not aligned), and denotes an insertion in S (i.e., one node in S is not aligned). As shown in Fig. 4, each alignment can be represented as a path in an alignment matrix, in which each vertex can be exactly determined by its position in the matrix and its state. For example, the first vertex in the path can be written as (0, 0, dummy), the 2nd vertex as and the 3rd vertex as . Therefore, we can write an alignment as a set of triples, each of which has a form like where represents the position and u the state.

thumbnail
Figure 4. Representation of protein alignment.

(A) Represented as a sequence of states. (B) Each alignment is a path in the alignment matrix.

https://doi.org/10.1371/journal.pcbi.1003500.g004

Node alignment potential.

Given an alignment path, its node alignment potential is the accumulative potential of all the vertices in the path. We use a Conditional Neural Fields (CNF) [39] method, which is very similar to what is described in the protein threading paper [40], to estimate the occurring probability of an alignment and then derive node alignment potential from this CNF. Briefly speaking, we estimate the occurring probability of an alignment A between T and S as follows.(2)where is a normalization factor summarizing all the possible alignments between T and S, and is a neural network with one hidden layer that calculates the log-likelihood of a vertex in the alignment path, where i is a node in T, j a node in S, and u a state. When u is a match state, takes as input the sequence profile context of two nodes i and j, denoted as and , respectively, and yields the log-likelihood of these two nodes being matched. When u is an insertion state, it takes as input at the sequence profile context of one node and yields the log-likelihood of this node being an insertion. The sequence profile context of node i is a matrix where , consisting of the marginal probability of 20 amino acids and gap at nodes indexed by , In case that one column does not exist (when or ), zero is used. We train the parameters in by maximizing the occurring probability of a set of reference alignments, which are generated by a structure alignment tool DeepAlign [31]. That is, we optimize the model parameters so that the structure alignment of one training protein pair has the largest probability among all possible alignments. A -norm regularization factor, which is determined by 5-fold cross validation, is used to restrict the search space of model parameters to avoid over-fitting. See the paper [40] for more technical details.

Let denote the potential of a vertex in the alignment path. We calculate from as follows.(3)where is the expected value of . It is used to offset the effect of the background, which is the log-likelihood yielded by for any randomly-chosen node pairs (or nodes). Once is determined, we can approximate its expected value by sampling. That is, we sample ten thousands of node pairs (or nodes) from the training data, feed their sequence information into and then calculate the average output of as its expected value.

Edge alignment potential.

It calculates the similarity of two edges, one from each MRF, based upon the interaction strength of two ends in one edge. We can derive interaction strength from the parameters of the MRF model, but it is hard to validate if this interaction strength (or mutual information) is accurate or not even in the presence of native structures since we cannot directly measure interaction strength in a protein. Here we use inter-residue Euclidean distance, which can be measured more easily, to reflect interaction strength of two residues. Later in this section we will describe how to derive the distance probability distribution from the information (e.g., interaction strength) encoded in MRFs. Let denote the Euclidean distance between two residues at i and k and is defined similarly. Note that and are unknown since this paper studies sequence-based homology detection in the absence of native structures. Let denote the alignment potential between edge in T and edge in S. We calculate as follows.(4)where is the probability of two nodes i and k in T interacting at distance ; is the probability of two nodes j and l in S interacting at distance ; is the probability of one distance being aligned to another distance in reference alignments; and is the background probability of observing () in a protein structure.

Now we explain how to calculate each term in Eq. (4). can be calculated by simple counting on a set of non-redundant protein structures, e.g., PDB25. Similar to , can also be calculated by simple counting on a set of non-redundant reference alignments. That is, we randomly choose a set of protein pairs such that two proteins in each pair are similar at least at the fold level. Then we generate their reference alignment (i.e., structure alignments) using a structure alignment tool DeepAlign [31] and finally do simple counting to estimate . In order to use simple counting, we discretize inter-residue distance into 12 intervals: <4 Å, 4–5 Å, 5–6 Å, …, 14–15 Å, and >15 Å.

In Eq. (4), is the probability of two nodes i and k interacting at distance , conditioned on information derived from the MRF. is defined similarly. Meanwhile, and are the sequence profile contexts of two nodes i and k, respectively, and represents the mutual information and its power series (or interaction strength) between these two nodes. The sequence profile context of node i is a matrix where , consisting of the occurring probability of 20 amino acids and gap at nodes indexed by . In case that one column does not exist (when or ), zero is used. We predict using a probabilistic neural network (PNN) implemented in our context-specific distance-dependent statistical potential package EPAD [41]. EPAD takes as input sequence contexts and mutual information and then yields inter-residue distance probability distribution. Compared to contact information, here we use interaction at a given distance to obtain a higher-resolution description of the residue interaction pattern. Therefore, our scoring function contains more information and thus, may yield better alignment accuracy and homology detection rate.

Aligning two MRFs by ADMM (Alternating Direction Method of Multipliers)

As mentioned before, an alignment can be represented as a path in the alignment matrix, which encodes an exponential number of paths. We can use a set of binary variables to indicate which path is chosen, where and are the lengths of the two MSAs, is an entry in the alignment matrix and u is the associated state. is equal to 1 if the alignment path passes with state u. Therefore, the problem of finding the best alignment between two MRFs can be formulated as the following quadratic optimization problem.(P1)where and are node and edge alignment potentials as described in previous section. Meanwhile, is equal to 0 if either u or v is not a match state. L is the alignment length and is used to make the accumulative node and edge potential have similar scale. Note that L is unknown and we will describe how to determine it later in this section. Finally, the solution of P1 shall be subject to the constraint that all those with value 1 shall form a valid alignment path. This constraint shall also be enforced to all the optimization problems described in this section.

It is computationally intractable to find the optimal solution of P1. Below we present an ADMM (Alternating Direction Method of Multipliers) method that can efficiently solve this problem to suboptimal. See [42] for a tutorial of the ADMM method. To use ADMM, we rewrite P1 as follows by making a copy of z to y, but without changing the solution space.(P2)Problem P2 can be augmented by adding a term to penalize the difference between z and y.(P3)P3 is equivalent to P2 and P1, but converges faster due to the penalty term. Here is a hyper-parameter influencing the convergence rate of the algorithm. Some heuristics algorithms were proposed for choosing at each iteration, such as [43], [44]. Empirically, setting to a constant ( = 0.5) enables our algorithm to converge within 10 iterations for most protein pairs.

Adding the constraint using a Lagrange multiplier to Eq. (7), we have the following Lagrangian dual problem:(P4)It is easy to prove that P3 is upper bounded by P4. Now we will solve P4 and use its solution to approximate P3 and thus, P1.

Since both z and y are binary variables, the last term in (P4) can be expanded as follows.(5)For a fixed , we can split P4 into the following two sub-problems.(SP1)where (SP2)where

The sub-problem SP1 optimizes the objective function with respect to y while fixing z, and the sub-problem SP2 optimizes the objective function with respect to z while fixing y. SP1 and SP2 do not contain any quadratic term, so they can be efficiently solved using the classical dynamic programming algorithm for sequence or HMM-HMM alignment.

In summary, we solve P4 using the following procedure.

  1. Initialize z by aligning the two MRFs without the edge alignment potential, which can be done by dynamic programming. Accordingly, initialize L as the length of the initial alignment.
  2. Solve (SP1) first and then (SP2) using dynamic programming, each generating a feasible alignment.
  3. If the algorithm converges, i.e., the difference between z and y is very small or zero, stop here. Otherwise, we update the alignment length L as the length of the alignment just generated and the Lagrange multiplier using subgradient descent as in Eq. (6), and then go back to Step 2).(6)
Due to the quadratic penalty term in P3 this ADMM algorithm usually converges much faster and also yields better solutions than without this term. Empirically, it converges within 10 iterations for most protein pairs. See [42] for the convergence proof of a general ADMM algorithm.

A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5 [45].

Author Contributions

Conceived and designed the experiments: JM SW JX. Performed the experiments: JM SW ZW. Analyzed the data: JM SW JX. Contributed reagents/materials/analysis tools: JM SW ZW. Wrote the paper: JM JX.

References

  1. 1. Forrest LR, Tang CL, Honig B (2006) On the accuracy of homology modeling and sequence alignment methods applied to membrane proteins. Biophysical journal 91: 508–517.
  2. 2. Schwede T, Kopp J, Guex N, Peitsch MC (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic acids research 31: 3381–3385.
  3. 3. Karplus K, Barrett C, Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14: 846–856.
  4. 4. Ma J, Wang S, Zhao F, Xu J (2013) Protein threading using context-specific alignment potential. Bioinformatics 29: i257–i265.
  5. 5. Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular biology and evolution 18: 691–699.
  6. 6. Feng D-F, Doolittle RF (1987) Progressive sequence alignment as a prerequisitetto correct phylogenetic trees. Journal of molecular evolution 25: 351–360.
  7. 7. Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular biology and evolution 27: 221–224.
  8. 8. Sokolov A, Ben-Hur A. Multi-view prediction of protein function; 2011. ACM. pp. 135–142.
  9. 9. Engelhardt BE, Jordan MI, Muratore KE, Brenner SE (2005) Protein molecular function prediction by Bayesian phylogenomics. PLoS computational biology 1: e45.
  10. 10. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, et al. (2013) A large-scale evaluation of computational protein function prediction. Nature methods 10: 221–7.
  11. 11. Rychlewski L, Li W, Jaroszewski L, Godzik A (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 9: 232–241.
  12. 12. Eddy SR (2011) Accelerated profile HMM searches. PLoS computational biology 7: e1002195.
  13. 13. Söding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21: 951–960.
  14. 14. Bairoch A (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 19: 2241.
  15. 15. Park J, Teichmann SA, Hubbard T, Chothia C (1997) Intermediate sequences increase the detection of homology between sequences. Journal of molecular biology 273: 349–354.
  16. 16. Lu Y, Sze S-H (2008) Multiple sequence alignment based on profile alignment of intermediate sequences. journal of Computational Biology 15: 767–777.
  17. 17. Daniels NM, Hosur R, Berger B, Cowen LJ (2012) SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone. Bioinformatics 28: 1216–1222.
  18. 18. Pietrokovski S (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Research 24: 3836–3845.
  19. 19. Lyngso R, Pedersen CN, Nielsen H (1999) Metrics and similarity measures for hidden Markov models. Proc Int Conf Intell Syst Mol Biol 178–86.
  20. 20. Edgar RC, Sjölander K (2003) SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics 19: 1404–1411.
  21. 21. Yona G, Levitt M (2002) Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. Journal of molecular biology 315: 1257–1275.
  22. 22. von Öhsen N, Sommer I, Zimmer R (2003) Profile-profile alignments: a powerful tool for protein structure prediction. Pac Symp Biocomput 252–63.
  23. 23. Sadreyev R, Grishin N (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. Journal of molecular biology 326: 317–336.
  24. 24. Brenner SE, Koehl P, Levitt M (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Research 28: 254–256.
  25. 25. Park J, Karplus K, Barrett C, Hughey R, Haussler D, et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal of molecular biology 284: 1201–1210.
  26. 26. Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher kernel method to detect remote protein homologies. Proc Int Conf Intell Syst Mol Biol 149–58.
  27. 27. Liao L, Noble WS (2003) Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of computational biology 10: 857–868.
  28. 28. Leslie CS, Eskin E, Noble WS (2002) The spectrum kernel: A string kernel for SVM protein classification. World Scientific. pp. 566–575.
  29. 29. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20: 467–476.
  30. 30. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology 247: 536–540.
  31. 31. Wang S, Ma J, Peng J, Xu J (2013) Protein structure alignment beyond spatial proximity. Scientific reports 3.
  32. 32. Angermüller C, Biegert A, Söding J (2012) Discriminative modelling of context-specific amino acid substitution probabilities. Bioinformatics 28: 3240–3247.
  33. 33. Remmert M, Biegert A, Hauser A, Söding J (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods 9: 173–175.
  34. 34. Zhang Y, Skolnick J (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic acids research 33: 2302–2309.
  35. 35. Menke M, Berger B, Cowen L (2008) Matt: local flexibility aids protein multiple structure alignment. PLoS computational biology 4: e10.
  36. 36. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25: 3389–3402.
  37. 37. Jones DT, Buchan DW, Cozzetto D, Pontil M (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28: 184–190.
  38. 38. Wang Z, Xu J (2013) Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics 29: i266–i273.
  39. 39. Peng J, Bo L, Xu J (2009) Conditional neural fields. pp. 1419–1427.
  40. 40. Ma J, Peng J, Wang S, Xu J (2012) A conditional neural fields model for protein threading. Bioinformatics 28: i59–i66.
  41. 41. Zhao F, Xu J (2012) A Position-Specific Distance-Dependent Statistical Potential for Protein Structure and Functional Study. Structure 20: 1118–1126.
  42. 42. Hestenes MR (1969) Multiplier and gradient methods. Journal of optimization theory and applications 4: 303–320.
  43. 43. Wahlberg B, Boyd S, Annergren M, Wang Y (2012) An ADMM algorithm for a class of total variation regularized estimation problems. arXiv preprint arXiv 12031828.
  44. 44. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3: 1–122.
  45. 45. Ma J, Wang S, Wang Z, Xu J (2014) MRFalign: Protein Homology Detection through Alignment of Markov Random Fields. In: Research in Computational Molecular Biology, Springer. pp 173–174.