MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.


Introduction
Sequence-based protein alignment and homology detection has been extensively studied and widely applied to many biological problems such as homology modeling [1][2][3][4], phylogeny inference [5][6][7] and protein function prediction [8][9][10]. Although extensively studied, remote homology detection still remains very challenging, especially for homologs with divergent sequences. So far the most sensitive method for homology detection is based upon comparison of protein sequence profiles, which are usually derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. That is, instead of aligning two primary sequences, homologs can be detected by aligning protein sequence profiles. To facilitate comparison and alignment, an MSA is usually represented as a position-specific scoring matrix (PSSM) [11] or an HMM (Hidden Markov Model) [12,13]. HMM is more sensitive than PSSM because 1) HMM contains positionspecific gap information; and 2) HMM also takes into account correlation among sequentially adjacent residues. Sequence signature libraries [14] and intermediate sequence based methods [15,16] are also developed to make use of evolutionary information of a protein. All these methods are sensitive to close homologs, but not good enough for remote homologs. The main issue of existing profile-based methods lies in that they make use of only position-specific amino acid mutation patterns and very shortrange residue correlation, but not long-range residue interaction. However, remote homologs may have very divergent sequences and are only similar at the level of (long-range) residue interaction pattern, which is not encoded in current popular PSSM or HMM models.
To significantly advance homology detection, this paper presents a Markov Random Fields (MRFs) modeling of a multiple sequence alignment (MSA). Compared to HMM, MRFs can model long-range residue interactions and thus, encodes information for the global 3D structure of a protein family. In particular, MRF is a graphical model encoding a probability distribution over the MSA by a graph and a set of preset statistical functions. A node in the MRF corresponds to one column in the MSA and one edge specifies correlation between two columns. Each node is associated with a function describing position-specific amino acid mutation pattern. Similarly, each edge is associated with a function describing correlated mutation statistics between two columns. With MRF representation, alignment of two proteins or protein families becomes that of two MRFs. To align two MRFs, a scoring summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2 5.
-This Methods article is associated with RECOMB 2014 function or alignment potential is needed to measure the similarity of two MRFs. We use a scoring function consists of both node alignment potential and edge alignment potential, which measure the node (i.e., amino acid) similarity and edge (i.e., interaction pattern) similarity, respectively. It is computationally challenging to optimize a scoring function containing edge alignment potential. To deal with this, we formulate the MRF-MRF alignment problem as an integer programming problem and then develop an ADMM (Alternative Direction Method of Multipliers) algorithm to solve it efficiently to a suboptimal solution. ADMM divides the MRF alignment problem into two tractable sub-problems and then iteratively solve them until they reach consistent solutions.
Experiments show that our MRF-MRF alignment method, denoted as MRFalign, can generate more accurate alignments and is also much more sensitive than others in detecting remote homologs. MRFalign works particularly well on mainly-beta proteins.

Related work
Cowen has developed a program SMURFLite for fold recognition based upon the MRF representation of a protein family [17]. Nevertheless, our MRFalign method is significantly different from SMURFLite in a couple of aspects: 1) SMURLite builds an MRF based upon multiple structure alignment instead of multiple sequence alignment (MSA). As such, it cannot apply to sequence-based homology detection in the absence of native structures. In contrast, our method builds MRFs purely based upon MSA and thus, applies to sequence-based protein alignment and homology detection; and 2) SMURLite can only align a single primary sequence to an MRF, while our method aligns two MRFs to yield higher sensitivity. This difference requires us to develop totally new methods to build MRFs from MSA, measure similarity of two MRFs, and optimize the MRF-MRF alignment potential.
Some studies such as [20] also combine phylogeny information with PSSM-based profile comparison. Homology detection can also be done without aligning proteins. For example, we can represent a protein sequence or profile as a feature vector and then search for homologs by comparing feature vectors. Early methods such as [24] usually conduct straightforward comparison of feature vectors, but are not very sensitive [25]. Improvement in these alignment-free methods results from the application of discriminative learning approaches such as SVM-Fisher [26], SVM-pairwise [27], SVM with the spectrum kernel [28] and SVM with the mismatch kernel [29]. These SVMbased methods are reported to outperform the simple feature comparison methods. Comparing to alignment-based homology detection, alignment-free methods are usually faster but less sensitive.

Training and validation data
To train the node alignment potential, we constructed the training and validation data from SCOP70. The sequence identity of all the training and validation protein pairs is uniformly distributed between 20% and 70%. Further, two proteins in any pair are similar at superfamily or fold level. In total we use a set of 1400 protein pairs as the training and validation data, which covers 458 SCOP folds [30]. Five-fold cross validation is used to choose the hyper-parameter in our machine learning model. In particular, every time we choose 1000 out of the 1400 protein pairs as the training data and the remaining 400 pairs as the validation data such that there is no fold-level redundancy between the training and validation data. A training or validation protein has less than 400 residues and contains less than 10% of residues without 3D coordinates. The reference alignment for a protein pair is generated by a structure alignment tool DeepAlign [31]. Each reference alignment has fewer than 50 gap positions in the middle and the number of terminal gaps is less than 20% of the alignment length.

Test data
The data used to test alignment accuracy has no fold-level overlap with the training and validation data. In particular, we use the following three datasets to test the alignment accuracy, which are subsets of the test data used in [4] to benchmark protein modeling methods.
1. Set3.6K: a set of 3617 non-redundant protein pairs. Two proteins in a pair share ,40% sequence identity and have small length difference. By ''non-redundant'' we mean that in any two protein pairs, there are at least two proteins (one from each pair) sharing less than 25% sequence identity. 2. Set2.6K: a set of 2633 non-redundant protein pairs. Two proteins in a pair share ,25% sequence identity and have length difference larger than 30%. This set is mainly used to test the performance of one method in handling with domain boundary. 3. Set60K: a very large set of 60929 protein pairs, in most of which two proteins share less than 40% sequence identity. Meanwhile, 846, 40902, and 19181 pairs are similar at the SCOP family, superfamily and fold level, respectively, and 151, 2691 and 2218 pairs consist of only all-beta proteins, respectively.
We use the following benchmarks to test remote homology detection success rate.

Author Summary
Sequence-based protein homology detection has been extensively studied, but it remains very challenging for remote homologs with divergent sequences. So far the most sensitive methods employ HMM-HMM comparison, which models a protein family using HMM (Hidden Markov Model) and then detects homologs using HMM-HMM alignment. HMM cannot model long-range residue interaction patterns and thus, carries very little information regarding the global 3D structure of a protein family. As such, HMM comparison is not sensitive enough for distantly-related homologs. In this paper, we present an MRF-MRF comparison method for homology detection. In particular, we model a protein family using Markov Random Fields (MRF) and then detect homologs by MRF-MRF alignment. Compared to HMM, MRFs are able to model long-range residue interaction pattern and thus, contains information for the overall 3D structure of a protein family. Consequently, MRF-MRF comparison is much more sensitive than HMM-HMM comparison. To implement MRF-MRF comparison, we have developed a new scoring function to measure the similarity of two MRFs and also an efficient ADMM algorithm to optimize the scoring function. Experiments confirm that MRF-MRF comparison indeed outperforms HMM-HMM comparison in terms of both alignment accuracy and remote homology detection, especially for mainly beta proteins.
1. SCOP20, SCOP40 and SCOP80, which are used by Söding group to study context-specific mutation score [32]. They are constructed by filtering the SCOP database with a maximum sequence identity of 20%, 40% and 80%, respectively. In total they have 4884, 7088, and 9867 proteins, respectively, and 1281, 1806, and 2734 beta proteins, respectively.
We run PSI-BLAST with 5 iterations to detect sequence homologs and generate MSAs for the first three datasets. The MSA files for the three SCOP benchmarks are downloaded from the HHpred website (ftp://toolkit.genzentrum.lmu.de/pub/). Pseudocounts are used in building sequence profiles. Real secondary structure information is not used since this paper focuses on sequence-based homology detection.

Evaluation criteria
Three performance metrics are used including referencedependent alignment precision, alignment recall and homology detection success rate. Alignment precision is defined as the fraction of aligned positions that are correctly aligned. Alignment recall is the fraction of alignable residues that are correctly aligned. Reference alignments are used to judge if one residue is correctly aligned or alignable. To reduce bias, we use three very different structure alignment tools to generate reference alignments, including TM-align [34], Matt [35], and DeepAlign [31].

Reference-dependent alignment recall
As shown in Tables 1 and 2, our method MRFalign exceeds all the others regardless of the reference alignments on both dataset Set3.6K and Set2.6K. MRFalign outperforms HHalign by ,10% on both datasets, and HHMER by ,23% and ,24%, respectively. If 4-position off the exact match is allowed in calculating alignment recall, MRFalign outperforms HHalign by ,11% on both datasets, and HHMER by ,25% and ,33%, respectively.
On the very large set Set60K, as shown in Table 3, our method outperforms the other two in each SCOP classification regardless of the reference alignments used. MRFalign is only slightly better than HHalign at the family level, which is not surprising since it is easy to align two closely-related proteins. At the superfamily level, our method outperforms HHalign and HMMER by ,6% and ,18%, respectively. At the fold level, our method outperforms HHalign and HHMER by ,7% and ,14%, respectively.
On the very large set Set60K, as shown in Table 6, our method outperforms the other two in each SCOP classification regardless of the reference alignments used. At the family level, our method outperforms HHalign and HMMER by ,3% and ,4%, respectively. At the superfamily level, our method outperforms HHalign and HMMER by ,4% and ,5%, respectively. At the fold level, our method outperforms HHalign and HHMER by ,5% and ,8%, respectively.

Homology detection success rate
To evaluate homology detection rate, we employ three benchmarks SCOP20, SCOP40 and SCOP80 introduced in [32]. For each protein sequence in one benchmark, we treat it as a query, align it to all the other proteins in the same benchmark and then examine if those with the best alignment scores are similar to the query or not. We also conducted homology detection experiments using hmmscan, FFAS, HHsearch and HHblits with default options. The success rate is measured at the superfamily and fold levels, respectively. When evaluating the success rate at the superfamily (fold) level, we exclude those proteins similar to the query at least at the family (superfamily) level. For each query protein, we examine the top 1-, 5-and 10-ranked proteins, respectively.
As shown in Table 7, tested on SCOP20, SCOP40 and SCOP80 at the superfamily level, our method MRFalign succeeds on ,6%, ,4% and ,4% more query proteins than HHsearch, respectively, when only the first-ranked proteins are considered. As shown in Table 8, at the fold level, MRFalign succeeds on ,11%, ,11% and ,12% more proteins than HHsearch, respectively, when only the first-ranked proteins are evaluated. At the superfamily level, SCOP20 is more challenging than the other two benchmarks because it contains fewer proteins similar at this level. Nevertheless, at the fold level, SCOP80 is slightly more challenging than the other two benchmarks maybe because it contains many more irrelevant proteins and thus, the chance of ranking false positives at top is higher. Similar to alignment accuracy, our method for homology detection also has a larger advantage on the beta proteins. In particular, as shown in Table 9, tested on SCOP20, SCOP40 and SCOP80 at the superfamily level, MRFalign succeeds on ,7%, ,5% and ,7% more proteins than HHsearch, respectively, when only the first-ranked proteins are evaluated. As shown in Table 10, at the fold level, MRFalign succeeds on ,13%, ,16% and ,17% more proteins than HHsearch, respectively, when only the firstranked proteins are evaluated. Note that in this experiment, only the query proteins are mainly-beta proteins, the subject proteins can be of any types. If we restrict the subject proteins to only beta proteins, the success rate increases further due to the reduction of false positives.

Contribution of edge alignment potential and mutual information
To evaluate the contribution of our edge alignment potential, we calculate the alignment recall improvement resulting from using edge alignment potential on two benchmarks Set3.6K and Set2.6K. As shown in Table 11, our edge alignment potential can improve alignment recall by 3.4% and 3.7%, respectively. When mutual information is used, we can further improve alignment recall by 1.1% and 1.9% on these two sets, respectively. Mutual information is mainly useful for proteins with many sequence homologs since it is close to 0 when there are few sequence homologs. As shown in Table 11, if only those proteins with at least 256 non-redundant sequence homologs are considered, the improvement resulting from mutual information is ,3%. Figure 1 shows the running time of MRFalign with respect to protein length. As a control, we also show the running time of the Viterbi algorithm, which is used by our ADMM algorithm to generate alignment at each iteration. As shown in this figure, MRFalign is no more than 10 times slower than the Viterbi algorithm. To speed up homology detection, we first use the Viterbi algorithm to perform an initial search without considering edge alignment potential and keep only top 200 proteins, which are then subject to realignment and rerank by our MRFalign method. Therefore, although MRFalign may be very slow compared to the Viterbi algorithm, empirically we can do homology search only slightly slower than the Viterbi algorithm.

Is our MRFalign method overtrained?
We conducted two experiments to show that our MRFalign is not overtrained. In the first experiment, we used 36 CASP10 hard targets as the test data. Our training set was built before CASP10 started, so there is no redundancy between the CASP10 hard targets and our training data. Using MRFalign and HHpred, respectively, we search each of these 36 test targets against PDB25 to find the best match. Since PDB25 does not contain proteins very similar to many of the test targets, we built a 3D model using MODELLER from the alignment between a test target and its best match and then measure the quality of the model. As shown in Figure 2, MRFalign can yield much better 3D models than  HHsearch for most of the targets. This implies that our method can generalize well to the test data not similar to the training data.
In the second experiment, we divide the proteins in SCOP40 into three subsets according their similarity with all the training data. We measure the similarity of one test protein with all the training data by its best BLAST E-value. We used two values 1e-2 and 1e-35 as the E-value cutoff so that the three subsets have roughly the same size. As shown in Table 12, the advantage of our method in remote homology detection over HHpred is roughly same across the three subsets. Since HHpred is an unsupervised algorithm, this implies that the performance of our method is not correlated to the test-training similarity. Therefore, it is unlikely that our method is overfit by the training data.

Discussion
This paper has presented a new method for sequence-based protein homology detection that compares two protein sequences or families through alignment of two Markov Random Fields (MRFs), which model the multiple sequence alignment (MSA) of a protein family using an undirected general graph in a probabilistic way. The MRF representation is better than the extensively-used PSSM and HMM representations in that the former can capture long-range residue interaction pattern, which reflects the overall 3D structure of a protein family. As such, MRF comparison is much more sensitive than HMM comparison in detecting remote homologs. This is validated by our large-scale experimental tests showing that MRF-MRF comparison can greatly improve alignment accuracy and remote homology detection over currently popular sequence-HMM, PSSM-PSSM, and HMM-HMM comparison methods. Our method also has a larger advantage over the others on mainlybeta proteins.
We build our MRF model of a protein family based upon multiple sequence alignment (MSA) in the absence of native structures. The accuracy of the MRF model depends on the accuracy of an MSA. Currently we rely on the MSA generated by PSI-BLAST. In the future, we may explore better alignment methods for MSA building or even utilize solved structures of one or two protein sequences to improve MSA. The accuracy of the MRF model parameter usually increases with respect to the number of non-redundant sequence homologs in the MSA. Along with more and more protein sequences are generated by a variety of sequencing projects, we shall be able to build accurate MRFs for more and more protein families and thus, detect their homologous relationship more accurately.
An accurate scoring function is essential to MRF-MRF comparison. Many different methods can be used to measure node and edge similarity of two MRFs, just like many different scoring functions can be used to measure the similarity of two PSSMs or HMMs. This paper presents only one of them. In the future we may explore more possibilities. It is computationally intractable to find the best alignment between two MRFs when edge similarity is taken into consideration. This paper presents an ADMM algorithm that can efficiently solve the MRF-MRF alignment problem to suboptimal. However, this algorithm currently is about 10 times slower than the Viterbi algorithm for PSSM-PSSM alignment. Further tuning of this ADMM algorithm is needed for very large-scale homology detection. Given a protein primary sequence, we run PSI-BLAST [36] with 5 iterations and E-value cutoff 0.001 to find its sequence homologs. PSI-BLAST also generates an MSA of the sequence homologs. Let X i be a finite discrete random variable representing the amino acid at column i in the MSA, taking values from 1 to 21, corresponding to 20 amino acids and gap. Then we can use a multivariate random variable X~fX 1 ,X 2 , . . . ,X N g, where N is the number of columns, to model the MSA. We use an MRF to define the probability distribution of X. MRF is an undirected graph that can be used to model a set of correlated random variables. As shown in Fig. 3, an MRF node represents one column in the MSA and an edge represents the correlation between two columns i and k when Di{jD §6. We ignore very short-range correlation (i.e., Di{jDv6) since it is not very informative. The MRF consists of two types of functions: w(X i ) and y(X i ,X k ), where w(X i ) is an amino acid preference function for node i and y(X i ,X k ) is a pairwise amino acid preference function for edge (i, k) that reflects interaction between two nodes. Then, the probability of observing a particular protein sequence X can be calculated as follows.
where Z is the normalization factor. We use two kinds of information in MRFs for their alignment. One is the occurring probability of 20 amino acids and gap at each node (i.e., each column in MSA), which can also be interpreted as the marginal probability at each node. The other is the correlation between two nodes, which can be interpreted as interaction strength of two MSA columns and calculated by several different ways. For example, we can use a contact prediction program such as PSICOV [37] and PhyCMAP [38] for this purpose. PSICOV assumes that P(X ) is a Gaussian distribution function and calculates the correlation between two columns by inverse covariance matrix. PhyCMAP takes sequence information (including mutual information) as input and predicts the probability of two residues forming a contact, which can be used to indicate the interaction strength of two columns. However, it takes time to run these programs, in current implementation we calculate the mutual information (MI) and its power series of two columns as interaction strength. That is, we use MI, MI 2 , …, MI 11 to quantify all the pairwise interaction strength where MI is the mutual information matrix. The MI power series are much more informative than the MI alone, as tested in our contact prediction program PhyCMAP.

Scoring function for the alignment of two Markov Random Fields (MRFs)
Our scoring function for MRF-MRF alignment is a linear combination of node alignment potential and edge alignment potential with equal weight. Let T and S denote two MRFs for the two proteins under consideration. There are three possible alignment states M, I t and I s where M represents two nodes being aligned, I t denotes an insertion in T (i.e., one node in T is not aligned), and I s denotes an insertion in S (i.e., one node in S is not aligned). As shown in Fig. 4, each alignment can be represented as a path in an alignment matrix, in which each vertex can be exactly determined by its position in the matrix and its state. For example, the first vertex in the path can be written as (0, 0, dummy), the 2 nd vertex as (1,1,M) and the 3 rd vertex as (2,1,I s ). Therefore, we can write an alignment as a set of triples, each of which has a form like (i,j,u) where (i,j) represents the position and u the state.  Table 9. Homology detection performance for mainly beta proteins at the superfamily level. Node alignment potential. Given an alignment path, its node alignment potential is the accumulative potential of all the vertices in the path. We use a Conditional Neural Fields (CNF) [39] method, which is very similar to what is described in the protein threading paper [40], to estimate the occurring probability of an alignment and then derive node alignment potential from this CNF. Briefly speaking, we estimate the occurring probability of an alignment A between T and S as follows.
where Z(T,S) is a normalization factor summarizing all the possible alignments between T and S, and E u (T i ,S j ) is a neural network with one hidden layer that calculates the log-likelihood of a vertex (i,j,u) in the alignment path, where i is a node in T, j a node in S, and u a state. When u is a match state, E u takes as input the sequence profile context of two nodes i and j, denoted as T i and S j , respectively, and yields the log-likelihood of these two nodes being matched. When u is an insertion state, it takes as input at the sequence profile context of one node and yields the log-likelihood of this node being an insertion. The sequence profile context of node i is a 21|(2wz1) matrix where w~5, consisting of the marginal probability of 20 amino acids and gap at 2wz1 nodes indexed by i{w, i{wz1, . . . ,i,iz1, . . . izw In case that one column does not exist (when iƒw or izwwN), zero is used. We train the parameters in E u by maximizing the occurring probability of a set of reference alignments, which are generated by a structure alignment tool DeepAlign [31]. That is, we optimize the model parameters so that the structure alignment of one training protein pair has the largest probability among all possible alignments. A L 2 -norm regularization factor, which is determined by 5-fold cross validation, is used to restrict the search space of model parameters to avoid over-fitting. See the paper [40] for more technical details. Let h u i,j denote the potential of a vertex (i,j,u) in the alignment path. We calculate h u i,j from E u as follows.
where Exp(E u ) is the expected value of E u . It is used to offset the effect of the background, which is the log-likelihood yielded by E u for any randomly-chosen node pairs (or nodes). Once E u is determined, we can approximate its expected value by sampling. That is, we sample ten thousands of node pairs (or nodes) from the training data, feed their sequence information into E u and then calculate the average output of E u as its expected value. Edge alignment potential. It calculates the similarity of two edges, one from each MRF, based upon the interaction strength of two ends in one edge. We can derive interaction strength from the parameters of the MRF model, but it is hard to validate if this interaction strength (or mutual information) is accurate or not even in the presence of native structures since we cannot directly measure interaction strength in a protein. Here we use inter-residue Euclidean distance, which can be measured more easily, to reflect interaction strength of two residues. Later in this section we will describe how to derive the distance probability distribution from the  information (e.g., interaction strength) encoded in MRFs. Let d T ik denote the Euclidean distance between two residues at i and k and d S jl is defined similarly. Note that d T ik and d S jl are unknown since this paper studies sequence-based homology detection in the absence of native structures. Let h i,k,j,l denote the alignment potential between edge (i,k) in T and edge (j,l) in S. We calculate h i,k,j,l as follows. and c k are the sequence profile contexts of two nodes i and k, respectively, and m ik represents the mutual information and its power series (or interaction strength) between these two nodes.
The sequence profile context of node i is a 21|(2wz1) matrix where w~7, consisting of the occurring probability of 20 amino acids and gap at 2wz1 nodes indexed by i{w,i{wz1, . . . ,i,iz1, . . . izw. In case that one column does not exist (when iƒw or izwwN), zero is used. We predict p(d T ik Dc i ,c k ,m ik ) using a probabilistic neural network (PNN) implemented in our context-specific distance-dependent statistical potential package EPAD [41]. EPAD takes as input sequence contexts and mutual information and then yields inter-residue Table 12. Fold recognition rate of our method on SCOP40, with respect to the similarity (measured by E-value) between the test data and the training data.  distance probability distribution. Compared to contact information, here we use interaction at a given distance to obtain a higher-resolution description of the residue interaction pattern. Therefore, our scoring function contains more information and thus, may yield better alignment accuracy and homology detection rate.

Aligning two MRFs by ADMM (Alternating Direction Method of Multipliers)
1 L X i,j,k,l,u,v h uv i,j,k,l z u i,j y v k,l z X It is easy to prove that P3 is upper bounded by P4. Now we will solve P4 and use its solution to approximate P3 and thus, P1. Since both z and y are binary variables, the last term in (P4) can be expanded as follows.
r 2 X i,j,u (z u i,j {y u i,j ) 2~r 2 X i,j,u (z u i,j zy u i,j {2z u i,j y u i,j ) 2 ð5Þ For a fixed l, we can split P4 into the following two sub-problems.
y Ã~a rg maxf X k,l,v y v k,l C v k,l g, ðSP1Þ where D u i,j~h u i,j z 1 L X k,l,v h uv i,j,k,l y vÃ k,l zl u i,j { r 2 (1{y uÃ i,j ) The sub-problem SP1 optimizes the objective function with respect to y while fixing z, and the sub-problem SP2 optimizes the objective function with respect to z while fixing y. SP1 and SP2 do not contain any quadratic term, so they can be efficiently solved using the classical dynamic programming algorithm for sequence or HMM-HMM alignment.
In summary, we solve P4 using the following procedure. 1) Initialize z by aligning the two MRFs without the edge alignment potential, which can be done by dynamic programming. Accordingly, initialize L as the length of the initial alignment. 2) Solve (SP1) first and then (SP2) using dynamic programming, each generating a feasible alignment. 3) If the algorithm converges, i.e., the difference between z and y is very small or zero, stop here. Otherwise, we update the alignment length L as the length of the alignment just generated and the Lagrange multiplier l using subgradient descent as in Eq. (6), and then go back to Step 2).
Due to the quadratic penalty term in P3 this ADMM algorithm usually converges much faster and also yields better solutions than without this term. Empirically, it converges within 10 iterations for most protein pairs. See [42] for the convergence proof of a general ADMM algorithm.