Generalized Baum-Welch Algorithm Based on the Similarity between Sequences

The profile hidden Markov model (PHMM) is widely used to assign the protein sequences to their respective families. A major limitation of a PHMM is the assumption that given states the observations (amino acids) are independent. To overcome this limitation, the dependency between amino acids in a multiple sequence alignment (MSA) which is the representative of a PHMM can be appended to the PHMM. Due to the fact that with a MSA, the sequences of amino acids are biologically related, the one-by-one dependency between two amino acids can be considered. In other words, based on the MSA, the dependency between an amino acid and its corresponding amino acid located above can be combined with the PHMM. For this purpose, the new emission probability matrix which considers the one-by-one dependencies between amino acids is constructed. The parameters of a PHMM are of two types; transition and emission probabilities which are usually estimated using an EM algorithm called the Baum-Welch algorithm. We have generalized the Baum-Welch algorithm using similarity emission matrix constructed by integrating the new emission probability matrix with the common emission probability matrix. Then, the performance of similarity emission is discussed by applying it to the top twenty protein families in the Pfam database. We show that using the similarity emission in the Baum-Welch algorithm significantly outperforms the common Baum-Welch algorithm in the task of assigning protein sequences to protein families.


Introduction
Structure and function determination of newly discovered proteins, using the information contained in their amino acid sequences, is one of the most important problems in genomics [1]. Often, but certainly not always, as the homologous proteins have similar sequences and structures, they have similar functions [2]. The profile hidden Markov model (PHMM) can be applied to determine the related proteins by sequence comparison [3]. The parameters of a PHMM are of two types; transition and emission probabilities. Under a PHMM, there are two assumptions made for transition and emission probabilities as follows: 1. The t th hidden state, given the (t{1) th hidden state, is independent of previous states. 2. The t th observation depends only on the t th state.
The PHMM is specified as a triplet l~(A,B,P) where A is the transition probability matrix, B is the emission probability matrix and P is the vector of initial probabilities. An important task in assigning a new protein sequence to a protein family is to estimate the parameters of the PHMM. The Parameters of a PHMM (transition probability matrix and emission probability matrix) can be estimated in two ways: they can be estimated either from the aligned sequences or unaligned sequences using the Baum-Welch algorithm [4].
The Baum-Welch algorithm works by guessing initial parameter values, then estimating the likelihood of the observation under the current parameters. This likelihood then will be used to reestimate the parameters iteratively until a local maximum is reached. The Baum-Welch algorithm finds Max l P(observation D l) by considering only the information on the previous state of a hidden state. In other words, in the process of the Baum-Welch algorithm, it is assumed that given states the observations are independent and only the dependency between hidden states is considered. So, the dependency between observations can be combined with the PHMM. For this purpose, the multiple sequence alignment (MSA) which is a representative of a PHMM can be considered. In this paper the ClustalW program which is the current implementation of MSA is used for consideration of the dependency between observations. Based on the MSA, one-by-one dependencies between corresponding amino acids of two current sequences that model the similarity between them can be appended to the PHMM. This approach in spirit is similar to the works proposed by Holmes [5], Qian and Goldstein [6] and Siepel [7] where a PHMM is augmented with phylogenetic trees. In their approach, the evolutionary information is appended to the PHMM. They considered the dependency between sequences based on the fact that all the current sequences (external nodes in the guide tree or phylogenetic tree) are dependent upon their ancestral sequences.
Based on their idea, there is no dependency between two current sequences.
But in our approach, the dependency between two current sequences based on the similarity between them can be appended to the PHMM. Based on the fact that with a MSA, the sequences are biologically related, we can use the MSA to find the areas of similarity between two current sequences. So, the MSA is used for consideration of the one-by-one dependency between observations. In other words, the dependency between corresponding amino acid located above the residue and the residue can be combined with the PHMM. Therefore the new parameters of PHMM called similarity emission (SE) probabilities are created and should be estimated.
It should be noted that the similarity emission probabilities are estimated from the MSA and then combined with the common emission probabilities estimated from Baum-Welch algorithm to generalize the Baum-Welch algorithm. In other words, both aligned and unaligned sequences are used to generalize the Baum-Welch algorithm: aligned sequences for estimation of the similarity emission probabilities and unaligned sequences for estimation of the common emission and transition probabilities.
In this paper, we first construct a PHMM. Then using a MSA, we model the similarity emission (SE) matrix for consideration of the similarity information and generalize the Baum-Welch algorithm. We finally compare the results of applying the similarity emission to the Baum-Welch algorithm with the results of the commonly used emission for sequence alignment. For this purpose we use real data from the top twenty protein families in the Pfam database [8].

The PHMM
The profile hidden Markov model (PHMM) is a useful method to determine distantly related proteins by sequence comparison [3]. The PHMM is a linear structure of three states named; Match (M), Delete (D), and Insert (I). Therefore we need to decide how many states exist in a PHMM. In other words, how many match states do we have in a family? Here we assume that K is the number of match states in the PHMM. A commonly used rule is to set K equal to the number of columns of the MSA including more than half of the amino acid characters. Note that the number of match states is related to the length of the MSA [9]. So, the total number of M, D and I states is 3K. Begin and End states which emit no output symbols are introduced as dummy states [9]. Since there is an Insert state for each transition, there should be a transition from Begin called I 0 . Therefore the total number of states is 3K+3. Twenty amino acids are observed from Match and Insert states. Delete, Begin and End states are silent states because they do not emit any symbols.
Following Durbin [10], we estimate the transition probabilities, A, and the emission probabilities, B, using the plan7 construction ( Figure 1). Unlike the original Krogh/Haussler and HMMER model architecture, Plan 7 has no DRI or IRD transitions. This reduction from 9 to 7 transitions per node in the main model is the origin of the codename Plan 7. Note that the transition probability a ij is the probability of moving from state i to state j i.e. a ij~P (entering state q j at time tz1Dthe process is in state q i at time t) and emission probability b j (k) is the probability of observation o k being emitted from state s j i.e. b j (k)~P(producing o k at time tDthe process is in state q j at time t): Parameters of a PHMM (transition probability matrix A (3Kz3)|(3Kz3) , and emission probability matrix B (3Kz3)|20 ) can be estimated using the Baum-Welch algorithm.

Considering The Similarity Between Sequences in the Baum-Welch Algorithm
The sequences appearing in the final multiple sequence alignment are written based on their similarity [10]. So, in a PHMM, the one-by-one dependency between corresponding amino acids of two current sequences can be considered. Therefore, we propose a model which considers the effect of the similarity information (the dependency between observation) as well as the effect of the hidden state on the previous state of an amino acid in a PHMM. For consideration of the similarity information, we introduce a similarity emission probability matrix based on the multiple sequence alignment. This matrix illustrates the similarity dependencies among the observations. Following the MSA, we assume that protein sequences consisting of 21 observations (20 amino acids and one gap) have been placed on a regular lattice. In other words, each observation is arranged as a site and a matrix with R rows and L columns (length of sequences) is obtained. This matrix is called the MSA matrix, in which the site position above the s = (r, c) is denoted by (r -1, c). Hence, we assume each site on the lattice has a dependency with the corresponding residue located at the above position. This scheme is a special case of the discrete state hidden Markov random field (HMRF) with 2-point cliques ( Table 1). Note that the adjective 'hidden' refers to the states. The ingredients of this model are as follows: where, s{1 is a lattice point at previous state of s and N is the total number of hidden states. 7. Emission probabilities on the lattice: a matrix B with the following entries: where M is a set of symbols that may be observed. 8. Emission probabilities on the lattice based on the above position: a matrix E with the following entries representing the probabilities of the above position of an observation on the lattice: where O Ls is an observation (amino acid or gap) at the above position. 9. Initial value: the probability of starting state at s~(r,1), Vr §1: The likelihood of the parameters (l) given the observations is: where n is a constant equal to  (1), we wish to find l Ã~( A,B,E,P), where l Ã = argmax l L(lDO). The entries of matrices A, B, and the vector P will be estimated through the following steps [11]: 1. Define auxiliary forward variable a s (i) which is the probability of the partial observation sequence O 1 , Á Á Á ,O s at lattice points 1, Á Á Á ,s when it terminates at the state i: 2. Define backward variable b s (i) as the probability of the partial observation sequence O sz1 , Á Á Á ,O T , given that the current state is i: 3. Calculate j s (i,j) as the probability of being in state i at lattice point s and in state j at lattice point sz1, given observations and model: This is the same as, Using forward and backward variables this can be expressed as, 4. Define variable c s (i) as the probability of being in state i at lattice point s, given the observations and model: In forward and backward variables this can be expressed by, Now it is possible to use the Baum-Welch algorithm to maximize the quantity, P(ODl). The estimation of parameters based on iterative calculation can be obtained by the following expressions:â p p s (j)~c s (j),s~(r,1): Hence, we have the matrices of parameter estimation B B (3Kz3)|20 ,Â A (3Kz3)|(3Kz3) , and vectorP P.
Since the matrix E is a type of emission probability matrix, it should have the same size as the matrix B (3Kz3)|20 . The estimation method for the matrix E (3Kz3)|20 is as follows: In the MSA matrix, the frequencies of ordered pairs of 20 amino acids and the gap i.e. (O Ls ,O s ) in each column are determined. It should be noted that O s represents the amino acids (20 types) at lattice point s = (r,c) and O Ls is the amino acids or one gap (21 types) located above O s . In other words, for a given amino acid O s , the position (r{1, c) can be filled with any of 20 types of amino acids or the gap. hence, we can imagine of having a 420 (20|21) by L frequency matrix. After dividing these frequencies by the sum of frequencies in each column, the probabilities are estimated as follows: for which i and k are amino acids or the gap andÊ E s (i,k) is the conditional probability of i given k at lattice point s. This procedure produces the matrixÊ E 420|L .
In each column of matrixÊ E 420|L , for every set of 21 probabilities, the highest probability is chosen. In other words, the highest probability for a given amino acid O s in the position (r, c) should be chosen. Then the matrixÊ E 420|L is reduced to a new matrix with 20 rows and L columns (Ê E 20|L ). After transposing the matrixÊ E 20|L , the matrixÊ E L|20 is obtained.
We assume that the L rows consist of Match and Insert states in which each Insert state can be repeated on its own several times. Using this assumption, we determine the Match states inÊ E L|20 corresponding to Match states inB B (3Kz3)|20 . Note that there are K Match states. In addition, the average values of the rows between each of two Match states inÊ E L|20 are considered as Insert states. So, the matrixÊ E L|20 is changed to the matrix E E (2Kz1)|20 with 2Kz1 Match and Insert states in a row. The Delete states are included by adding zeros to the rows ofÊ E, so that 3Kz1 states are obtained.
Since the Begin and the End states are silent and do not emit any symbols, the two rows with zero number can be added at the beginning and the end of matrixÊ E (3Kz1)|20 . Consequently the matrixÊ E (3Kz3)|20 is obtained. This matrix is the estimation of the matrix {P(O Ls DO s )} (3Kz3)|20 .

Similarity Emission Matrix
The Baum-Welch algorithm defines an iterative procedure for estimating the parameters. It computes maximum likelihood estimators for the unknown parameters given observation [11]. Since the Baum-Welch algorithm finds local optima, it is important to choose initial parameters carefully. In this paper we perform the algorithm with different initial parameters in such a way that the transition probabilities into Match states are larger than transition probabilities into other states. In order to improve the prediction accuracy of assigning sequences to protein families, we consider both emission probability matricesÊ E (3Kz3)|20 and B B 3Kz3|20 . We generalize the Baum-Welch algorithm by integrating the both emission probability matricesÊ E (3Kz3)|20 and B B 3Kz3|20 called similarity emission matrix (SE). In what follows, we give the details: 3. Choose the highest probability for each set of twenty one probabilities of each column of matrixÊ E 420|L , to obtain the matrixÊ E 20|L 4. Transpose the matrixÊ E 20|L to obtain the matrixÊ E L|20 5. Write directly the values of Match states of L rows and the average values of Insert states between two Match states of the matrixÊ E L|20 to obtain the matrixÊ E (2Kz1)|20 . It should be noted the Match and Insert States will be obtained by using the multiple sequence alignment. 6. Add zero rows after each Match and Insert states to thê E E (2Kz1)|20 and also two zero rows as Begin and End states to obtain the matrixÊ E (3Kz3)|20 7. Use Hadamard product that is the entry-wise product of E E (3Kz3)|20 andB B (3Kz3)|20 and then divide the entries by 0.047 to get the estimated similarity emissionŜ SE (3Kz3)|20 with the following entries:

Data Preparation
The Pfam is a well known database of protein families [8]. It is widely used to align new protein sequences to the known proteins of a given family. There are two components in Pfam: Pfam-A and Pfam-B. The entries of Pfam-A have high quality. As shown in Table 2, we use twenty families of Pfam-A for assigning the protein sequences to these families. In this paper due to computational challenges and round-off errors in estimating parameters, we selected just twenty protein families from Pfam database which called top twenty HMM.

Results and Discussion
To assess the performance of our method, ten sequences from each of the top twenty families are randomly removed. These ten removed sequences in each family are used as test sequences, while the other sequences form the training set. We repeat this procedure ten times. Since some of the protein families contain few proteins (likeGP120 and Oxidored q1), we choose just ten samples. Therefore, each time we have selected 200 sequences. In total 2000 sequences are randomly removed. Then we estimate the transition matrix A (3Kz3)|(3Kz3) , emission matrices B (3Kz3)|20 and E (3Kz3)|20 for each protein family. Given top twenty protein families, the score of each removed sequence belonging to each family are computed and compared. To score a sequence and assign it to one of the top twenty families, we use the logarithm of the probability score. It is defined by where prob is the probability of sequence based on parameter estimation and null-prob is equal to (0:05) T where T is the length of sequence. Since there are twenty amino acids, the probability of random occurrence of each of them is 0.05. Hence, for a sequence of L amino acids, the probability of random occurrence is (0:05) L . In this paper, due to computational challenges and round-off errors in estimating probabilities of B (3Kz3)|20 and E (3Kz3)|20 , we have employed logarithm transformation instead of the direct multiplication of these probabilities: log 2Ŝ SE~log 2b bzlog 2Ê E{log 2 0:047: The mean and standard error of the numbers of correctly assigned proteins to the top twenty protein families are shown in Table 3. Based on the results shown in Table 3, the assignment of sequences to the protein families using theŜ SE (3Kz3)|20 is considerably improved. For all protein families, more than half of the sequences are assigned correctly. In the task of assigning protein sequences, measuring the specificity is also important to prevent false positive prediction. Specificity is a statistical measure of the performance of a classification test, also known in statistics as classification function. Specificity measures the proportion of negatives which are correctly identified. This measure is closely related to the concepts of type II errors in testing a statistical hypothesis . Specificity relates to the ability of the test to identify negative results. This can also be written as: where, TN is the number of True Negative and FP is the number of False Positive. In other words, specificity means how many of the true negatives are detected? Ideally, suitable method should have high specificity or a perfect predictor would be described as 100% specificity. The specificity on average is about 75% and 62% using similarity emission and common emission model respectively. In addition to the correct assignment, the mean of the standard assigning scores, based on the matrixŜ SE 3Kz3|20 in most families are more than those obtained by the matrixB B 3Kz3|20 ( Table 4).
The results presented in this paper show that considering a model which incorporates the similarity information of the corresponding amino acid located above a residue in a protein family will result in a notable improvement in assignment task. It should be noted that based on the MSA implemented by ClustalW, one-by-one dependencies between corresponding amino acids of two current sequences that model the similarity between them can be appended to the PHMM. In other words, we combine the similarity emission matrix obtained form the aligned sequences and common emission matrix obtained from the unaligned sequences to generalize the Baum-Welch algorithm. Table 4. The mean and standard error of the standard scores of assigning sequences to each protein family based on the emission matrixB B (3Kz3)|20 and similarity emission matrixŜ SE (3Kz3)|20 .