The authors have declared that no competing interests exist.
Conceived and designed the experiments: VR HP HPS. Performed the experiments: VR HP HPS. Analyzed the data: VR HP HPS. Contributed reagents/materials/analysis tools: VR HP HPS. Wrote the paper: VR HP HPS.
The profile hidden Markov model (PHMM) is widely used to assign the protein sequences to their respective families. A major limitation of a PHMM is the assumption that given states the observations (amino acids) are independent. To overcome this limitation, the dependency between amino acids in a multiple sequence alignment (MSA) which is the representative of a PHMM can be appended to the PHMM. Due to the fact that with a MSA, the sequences of amino acids are biologically related, the onebyone dependency between two amino acids can be considered. In other words, based on the MSA, the dependency between an amino acid and its corresponding amino acid located above can be combined with the PHMM. For this purpose, the new emission probability matrix which considers the onebyone dependencies between amino acids is constructed. The parameters of a PHMM are of two types; transition and emission probabilities which are usually estimated using an EM algorithm called the BaumWelch algorithm. We have generalized the BaumWelch algorithm using similarity emission matrix constructed by integrating the new emission probability matrix with the common emission probability matrix. Then, the performance of similarity emission is discussed by applying it to the top twenty protein families in the Pfam database. We show that using the similarity emission in the BaumWelch algorithm significantly outperforms the common BaumWelch algorithm in the task of assigning protein sequences to protein families.
Structure and function determination of newly discovered proteins, using the information contained in their amino acid sequences, is one of the most important problems in genomics
The
The
The PHMM is specified as a triplet
The BaumWelch algorithm works by guessing initial parameter values, then estimating the likelihood of the observation under the current parameters. This likelihood then will be used to reestimate the parameters iteratively until a local maximum is reached. The BaumWelch algorithm finds
Based on the MSA, onebyone dependencies between corresponding amino acids of two current sequences that model the similarity between them can be appended to the PHMM. This approach in spirit is similar to the works proposed by Holmes
But in our approach, the dependency between two current sequences based on the similarity between them can be appended to the PHMM. Based on the fact that with a MSA, the sequences are biologically related, we can use the MSA to find the areas of similarity between two current sequences. So, the MSA is used for consideration of the onebyone dependency between observations. In other words, the dependency between corresponding amino acid located above the residue and the residue can be combined with the PHMM. Therefore the new parameters of PHMM called similarity emission (SE) probabilities are created and should be estimated.
It should be noted that the similarity emission probabilities are estimated from the MSA and then combined with the common emission probabilities estimated from BaumWelch algorithm to generalize the BaumWelch algorithm. In other words, both aligned and unaligned sequences are used to generalize the BaumWelch algorithm: aligned sequences for estimation of the similarity emission probabilities and unaligned sequences for estimation of the common emission and transition probabilities.
In this paper, we first construct a PHMM. Then using a MSA, we model the similarity emission (SE) matrix for consideration of the similarity information and generalize the BaumWelch algorithm. We finally compare the results of applying the similarity emission to the BaumWelch algorithm with the results of the commonly used emission for sequence alignment. For this purpose we use real data from the top twenty protein families in the Pfam database
The profile hidden Markov model (PHMM) is a useful method to determine distantly related proteins by sequence comparison
Following Durbin
The sequences appearing in the final multiple sequence alignment are written based on their similarity
S: a set of lattice points
s: a lattice point,
Emissions
Hidden states
Transition probabilities on the lattice: a matrix
Emission probabilities on the lattice: a matrix
Emission probabilities on the lattice based on the above position: a matrix
Initial value: the probability of starting state at
The likelihood of the parameters (
Define auxiliary forward variable
Define backward variable
Calculate
Define variable
Since the matrix
In the MSA matrix, the frequencies of ordered pairs of 20 amino acids and the gap i.e.
In each column of matrix
We assume that the
Since the Begin and the End states are silent and do not emit any symbols, the two rows with zero number can be added at the beginning and the end of matrix
The BaumWelch algorithm defines an iterative procedure for estimating the parameters. It computes maximum likelihood estimators for the unknown parameters given observation
Count the frequencies of ordered pairs of 20 amino acids and the gap, i.e.,
Calculate the probability matrix
Choose the highest probability for each set of twenty one probabilities of each column of matrix
Transpose the matrix
Write directly the values of Match states of
Add zero rows after each Match and Insert states to the
Use Hadamard product that is the entrywise product of
The Pfam is a well known database of protein families
profile  Number of sequence  
Seed  Full  
ABC tran  60  163029 
RVT 1  155  126258 
COX1  94  118265 
GP120  24  105452 
WD40  1842  101999 
RVP  50  93675 
zfC2H2  195  88330 
Response 
57  75322 
Cytochorm B N  92  70463 
HA TPase c  662  70410 
BPD transp 1  81  70027 
MFS 
196  69503 
Oxidored q1  33  60333 
Pkinase  54  56691 
Cytochrom 
114  51006 
RVT 
41  50191 
Adh short  230  50144 
Acetyltransf 1  243  46279 
Helicase 
491  42435 
HTH1  1556  41545 
To assess the performance of our method, ten sequences from each of the top twenty families are randomly removed. These ten removed sequences in each family are used as test sequences, while the other sequences form the training set. We repeat this procedure ten times. Since some of the protein families contain few proteins (likeGP120 and Oxidored q1), we choose just ten samples. Therefore, each time we have selected 200 sequences. In total 2000 sequences are randomly removed. Then we estimate the transition matrix
In this paper, due to computational challenges and roundoff errors in estimating probabilities of
profile  Mean  Standard Error  
Using 
Using 
Using 
Using 

ABC 
6.200  9.100  0.805  0.482 
RVT 1  9.102  9.723  0.588  0.531 
COX1  5.529  9.34  0.534  0.482 
GP120  9.034  9.980  0.460  0.405 
WD40  7.515  8.601  0.672  0.520 
RVP  6.129  8.802  0.801  0.672 
zfC2H2  1.980  9.001  0.534  0.578 
Response 
8.456  8.991  0.555  0.612 
Cytochorm B N  7.800  8.901  0.850  0.601 
HA TPase c  7.098  9.992  0.640  0.504 
BPD transp 1  7.091  8.002  0.605  0.604 
MFS 
8.409  8.997  0.583  0.538 
Oxidored q1  8.001  8.973  0.593  0.471 
Pkinase  2.009  8.623  0.981  0.812 
Cytochrom 
8.032  9.010  0.524  0.503 
RVTthumb  6.839  8.902  0.835  0.561 
Adh short  6.998  8.572  0.984  0.607 
Acetyltransf 1  6.504  9.760  0.551  0.504 
HelicaseC  7.228  8.423  0.682  0.634 
HTH 
1.734  7.991  0.609  0.684 
profile  Mean  Standard Error  
Using 
Using 
Using 
Using 

ABC_tran  −0.834  −0.503  0.054  0.043 
RVT 1  −0.546  −0.504  0.213  0.113 
COX1  0.789  0.881  0.085  0.054 
GP120  0.115  0.234  0.085  0.079 
WD40  0.356  0.487  0.076  0.065 
RVP  0.244  0.307  0.082  0.058 
zfC2H2  −0.567  −0.523  0.048  0.043 
Response_reg  −0.775  −0.709  0.061  0.062 
Cytochorm B N  2.143  3.4452  0.233  0.231 
HA TPase c  1.814  3.651  0.202  0.200 
BPD transp 1  0.807  0.718  0.069  0.058 
MFS_1  −0.213  −0.035  0.082  0.044 
Oxidored q1  −0.403  −0.352  0.050  0.078 
Pkinase  −0.046  0.567  0.070  0.065 
Cytochrom 
−0.749  −0.757  0.089  0.055 
RVT_thumb  0.005  0.142  0.057  0.021 
Adh short  −0.550  −0.523  0.079  0.078 
Acetyltransf 1  0.453  0.501  0.053  0.059 
HelicaseC  0.478  0.501  0.078  0.076 
HTH_1  0.640  0.703  0.070  0.052 
Vahid Rezaei is grateful to the School of Computer Science at IPM (No. CS1391421), and Department of Statistics and Actuarial Science, University of Waterloo. This work was completed when he was as a visiting Scholar at University of Waterloo. Hamid Pezeshk would like to thank the Department of Research Affairs of University of Tehran and Biomath research group of IPM.