Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches.


Appendices SolvingF and A
Since the marginal likelihood p(Y |D, C, a) is independent of F , we only need to consider the un-normalized posterior when maximizing posterior distribution (9) with regard to F , i.e.,F may be obtained by maximizing ψ(F ) By differentiating (14) with regard to F , where, ∇ log p(Y |F ) = 1 ⊗ d, 1 is an m dimensional column vector of ones, d = [d 11 , d 12 , · · · , d 1Q , d 21 , we can use the Newton's method to solve equation (17), with the iteration 0 , λ is the step length. λ can be obtained by maximizing Here, the dichotomy method is used to solve λ. By substitutingF into the negative Hessian matrix(16), the matrix A may be obtained

Learning matrix C and coefficients a
In GP model, the hyperparameters usually can be obtained by maximizing the marginal likelihood (10). Since the integral in marginal likelihood is intractable, one way to achieve this is to provide a lower bound for the marginal likelihood and then solve parameters by maximizing the lower bound. In this paper, the lower bound Z obtained by Kim et al. [Kim HC, Ghahramani Z (2006) Bayesian Gaussian process classification with the EM-EP algorithm. IEEE Transactions on Pattern Analysis and machine Intelligence 28: 1948-1959] will be used, i.e., In addition, because the parametersF and A of q(F |D, Y, C, a) are also the functions of C and a, it is difficult to maximize log Z directly. Here, an EM-like algorithm is used to solve this problem. In the E-step, we compute the values ofF and A by using (18) and (20) given the parameters C and a. In the M-step, C and a are updated by maximizing the lower bound log Z whereF and A is respectively fixed as the values ofF and A obtained in the E-step. The E-step and M-step are alternated until convergence. Since the terms (21) are only related toF and A, we only need to maximize ∫ q(F |D, Y, C, a) log p(F |D, C, a)dF in the M-step. By substituting (2) and (11) where, C o and a o j respectively denote the values of C and a j obtained in the last M step; {Ā ls |l, s = 1, 2, · · · , m} are square matrixes of order nQ by which A −1 is expressed with block, i.e., A −1 = (Ā ls ) m×m ; By differentiatingZ(C, a|F , A) with regard to C, we have Thus, we can obtain at the maximum ofZ(C, a|F , A). By substituting (24) into (22), we havē So, we can obtain a by maximizingZ(a|F , A) firstly, and then obtain C based on the equation (24). In this paper, the conjugate gradient method was used to solve a. In addition, in order to find a unique solution, a 1 was set to 1.

Computing r ik
It can be seen from the above Section that the posterior meanF can be obtained by minimizing the functionψ In the view of regularization, the first term of (26) represents smoothness assumption on F as encoded by a suitable reproducing kernel Hilbert space, and the second term is a data-fit term assessing the quality of the prediction F for the observed data Y . In the case of imbalanced data, the posterior meanF obtained by using (26) will tend to be overwhelmed by the majority classes. In order to deal with this problem, an intuitional idea is to make errors of fitting on minority-class data costlier than errors on majority-class data by setting different weighting coefficient for the data of different classes, i.e., change the second term is the logarithm of likelihood (5), the reason for dealing with the imbalance of data by using the likelihood (6) will be clear. In this paper, r ik is computed as follows where n + i = |{y ik |y ik = 1, k = 1, 2, · · · , n}| and n − i = |{y ik |y ik = −1, k = 1, 2, · · · , n}| denote the numbers of positive samples and negative samples for the ith location, respectively; | · | denotes the cardinality of a set.

Reducing the computational complexity of IMMMLGP
Because of the need to invert a Qn × Qn matrix B, the computational complexity of training the IM-MMLGP algorithm is about O((Qn) 3 ) which is prohibited by the problems with large data set. For the problem with moderate n, we can reduce the computational complexity of inverting the Qn × Qn matrix B by approximating m ∑ j=1 a j K j in the form m ∑ j=1 a j K j ≈ P P T , here, P is an n × n 0 matrix, n 0 ≪ n. Notice that, by representing ( m ∑ j=1 a j K j ) with P , B −1 can be expressed as where, B 1 = I + (P T ⊗ L T )W 0 (P ⊗ L), C = LL T . Thus, the problem is transformed into the inversion of a Qn 0 × Qn 0 matrix. In this paper, the optimal reduced-rank approximation U n0 Λ n0 U T n0 of m ∑ j=1 a j K j with respect to the Frobenius norm is used to obtain P , where Λ n0 is the diagonal matrix of the leading n 0 eigenvalues of m ∑ j=1 a j K j and U n0 is the matrix of the corresponding eigenvectors. Thus, P = U n0 Λ 1/2 n0 .
Unfortunately, this is limited for the problem with large n because the computation of the eigendecomposition is a O(n 3 ) operation. The Bayesian committee machine (BCM) [Tresp V (2000) A bayesian committee machine. Neural Computation 12: 2719-2741] can be used to improve the IMMMLGP model for dealing with the problems with large n. Instead of training one classifier using the whole training data set, the idea of BCM is to split up the training data set into several data subsets and train one sub-classifier on each of them, then combine the predictions of these individual sub-classifiers using a weighting scheme. Let {S 1 , S 2 , · · · , S L } be a partition of the training data set S. C α and a α respectively denote the values of C and a obtained on the training subset ∪ l∈α S l , α ⊂ {1, 2, · · · , L}. Based on the Bayesian committee machine, we can obtain the following approximation of the distribution ( Thus, for the problems with large training set, we can compute the predictive probability (13) by using (29) instead of (12).