Searching Remote Homology with Spectral Clustering with Symmetry in Neighborhood Cluster Kernels

Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of “recent” paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. Contact: sarkar@labri.fr.


Introduction
The remote homology detection from available protein sequences is one fundamental problem in comparative genomics. With higher sequence similarity, several panoply of methods can detect homologs accurately. However detecting remote homologs with subtle sequence similarity still remains a challenging problem.
Recently, the discriminative kernel methods with SVMs like mismatch string kernels [6,9], string alignment kernels [10], profile-based direct kernels [11] -exhibited better homology detection. These methods require extensive annotated proteins for training to yield good performances. The protein-structure kernel on MAMMOTH score in [12] and the combined approach of sequence and secondary-structure similarity scores in [13] also proved to be efficient. Incorporating incremental-kernel [14], multi-instance kernel [13] or gapped Markov-feature pairs [15] are the recent approaches for homology detection.
To compute the sequence distances, some groups utilized Connected Component Analysis(CCA) [16] on fully-connected graphs like GeneRAGE [17]. To improve them, Markov cluster algorithm(MCL) [18] utilizes random walks on Markov transition matrix to analyse the emergence of clusters in the graph, which encodes this matrix. The most successful methods for homology detection utilizing MCL algorithms [18] are OrthoMCL [19] and TribeMCL [20], which bias the random walks with 'inflation' parameter to promote the cluster emergence. Earlier non-kernel approach of [21] significantly utilize spectral clustering on protein sequences.
The semi-supervised protein clustering achieved efficiency earlier, introducing the neighborhood vector over profiles in cluster kernels by [22], [23]. The combined kernel approach using bagging-method over mismatch-string kernels [22] utilized the strength of combined clustering for remote homology. The protein-function prediction with kernels on Yeast genomes [24], introduced one kernel matrix for combining heterogeneous data.
Symmetry is an inherent feature to enhance recognition and reconstruction of shapes and objects. It reflects to be powerful for recognizing homolog protein clusters in kernel space. In [25] a symmetry based distance measure is proposed. Yet it fails to detect clusters with inherent symmetry relative to some intermediate point. Subsequently, the distance norm is corrected in [26] leading to a modified proximity norm, which is able to handle overlapping symmetrical clusters with multiclass points.
In this work, at first we develop new valid Mercer kernels based on similarities explicitly in local alignment methods like BLAST and PSI-BLAST. We present two positive semi-definitive localalignment kernels based on the singular-value decompositions of respectively MCL similarity scoring and position-specific scoring matrices (profiles). The Markov cluster similarity kernel further with the neighborhood feature vectors is enhanced. Furthermore incorporating the mismatches with profiles the diagonal dominance issue problem is reduced. This enables more accurate detection of remote homologs boosted by similarity deemphasizing multi-domain proteins. To reduce promiscuous domain problems, we further incorporate the spectral clustering approach over kernel matrices to alleviate inter-cluster edges implicitly selecting the leading eigenvectors from 'global' distances without using any hard-threshold. Finally, we introduce the modified-symmetry based correction over the homolog distributions in Hilbert space. This reduces number of singletons (represented as outliers) and classifies multi-domain proteins into more biologically-significant clusters with closest nearest-neighbor homologs from different domains. Contradicting with earlier discriminative approaches, this approach detects remote homology among unlabelled multidomain proteins without any prior annotation. Local-alignment kernels or Markov similarities are combined cascadingly with neighborhoods in spectral clustering, which are further enhanced by modified-symmetry based correction.
We experiment all our kernel frameworks over the multidomain proteins from Genolevures Yeast database [27]. The performance of our combined spectral kernels with modified symmetry are compared to other state-of-the-art combined cluster kernel methods. The experimental outcomes also demonstrate the superiority of introducing modified-symmetry over kernel space with spectral clustering to detect remote homologs more accurately even for multi-domain and promiscuous domain proteins. Moreover statistical and quantitative performance evaluations with five validity measures to demonstrate the significance of our proposed approaches are also performed. We also study the comparative results over our chosen dataset provided by the already-existing string [28] and mismatch [22] kernels with our proposed kernels. To experiment over the large datasets, we compare the clustering solutions of our proposed kernels with those of the already-existing string [28] and mismatch [22] kernels over the sequences of target 54 families from SCOP version 1.59 [22]. The scores provided by all algorithms also show the superiority of our proposed kernels with higher values.

Background
In this section, we briefly describe existing state-of-the-art cluster kernel methods for remote homolog proteins detection and the modified symmetry based distance measure for clustering.
Spectral clustering. In semisupervised learning, [23] introduced cluster kernels modifying the eigenspectrum of a kernel matrix. The spectral clustering kernel boils down to be the spectral graph partitioning into the sub-space of the k largest eigenvectors of a normalized affinity/kernel matrix [29]. Let us assume an undirected graph G~(V ,E) with vertices v i [V, for i~1, ::: ,n and edges e i,j [E with non-negative weights s i,j expressing the similarity between vertices v i and v j . Then the eigenvectors (v 1 , ::: where K is the RBF-kernel interpreted as a transition matrix of random walk on the graph.
The spectral clustering approach produced qualified clusters from protein sequences earlier [21], [23] following the work of Weiss [29] and Mealia and Shi [30] to simultaneously analyse k eigenvectors before normalizing.
Neighborhood mismatch kernel. To project the selection of closely related neighbor sequences through evolution from PSI-BLAST profiles in mismatch kernel, [22] defined a neighborhood kernel over the feature representation ) as shown below: where Nbd(x) denotes a neigborhood for sequence x over a sequence set x' with E-value less than a fixed threshold in PSI-BLAST/BLASTP search. As they proved the neighborhood averaged vector w nbd (x) stays within the convex hull of all vectors in neighborhood [22], this kernel boosts up the protein classification performance.
Modified symmetry based distance measure. Among the different distance measures for clustering like Euclidean, Pearson correlation or Spearman distance, none can detect symmetrical overlapping clusters. Su and Chou [25] proposed a symmetry based distance measure d s between a pattern x and a reference centroid c as follows: where x 1 is the symmetrical point of x with respect to c and d e (x,c) and d e (x 1 ,c) are Euclidean distances respectively between x and c and between x and c 1 . If x' represents the first nearest-  neighbor of x and is computed as x'~(2 Ã c{x), then d 1 represents Euclidean distance of x 1 and x'. To improve the effect of this symmetry-based distance norm even for inter-symmetrical clusters, Chou et al [26] proposed a modified measure d c as defined below: Therefore to detect compact symmetrical overlapping clusters we incorporate the modified-symmetry based distance measure [26]. This improves the biological significance of homology detection reducing outliers, as we discuss later.

Data
The Genolevures database explores nine complete genomes (Candida glabrata, Eremothecium gossipii, Kluyveromyces Lactis, Yarrowlo lipotytica, Zygosaccharomyces rouxii, Saccharomyces kluyveri, Kluyveromyces thermotolerans, Debaryomyces hansenii, Saccharomyces cerevisiae) [31], [27] from the class of Hemiascomycete yeasts. The non-redundant protein-family database was generated by progressively taking protein-coding gene-sequences following the family structures of Genolevures Release 3 candidate 3 data (2008-09-24) [32][33]. We use 323 sequences as unlabelled data from 23 Multiple choice families GL3M: Ã which are complicated families like polyproteins and repeat-domains. Therefore proper homology detection among them is suitable for our remote homology experiments. We use the Genolevures Release-3 candidate-3 [32] family structure as the true-clusters for ROC analysis. Finally we utilize 1000 sequences of the target 54 families from SCOP version 1.59 which it was experimented earlier in [22] for testing the performances of our proposed kernels over the large datasets. This dataset contains the kernel matrices generated from BLAST, PSI-BLAST and Spectrum mismatch kernels following the method of [22].

Methods
To explore remotely detected homologs even for multi-domain and promiscuous domain proteins, we define twelve simple and combined alignment cluster kernels in this section and evaluate them with spectral clustering.

Local alignment-based kernels
The local alignment kernel developed in [10] based on SW (Smith-Waterman) scores. They measured the pair-wise sequence similarity by summing up local alignment scores with sequence  gaps. They use a convolution of kernels with a point wise limit to the Mercer kernels. The probabilistic profiles of logarithmic Evalues generated by local alignment methods like BLASTP or PSI-BLAST are recently used for kernel generation instead of sequence encoding itself for protein classification [21]. However collecting these E-values for a pair of sequences into a matrix does not satisfy symmetric property in the alignment scores. The average interpretation of log10 of E-values between two sequences produces a symmetric kernel solving this problem in MCL algorithm [20]. This symmetric matrix is represented as a connection graph with weighted edges between proteins, which are searched iteratively for probabilities of protein transitions and matrix inflations by scaling the Hadamard power of the matrix. However utilizing the HSP (high-scoring segment pair) score of BLASTP results directly resembles the functionality of mismatch string kernel [6] to some extent. Therefore instead of using the Evalues as in earlier works for kernel formation, we utilize the BLASTP HSP score within the threshold cut-off to compute the kernel matrix which also satisfies the biological relevance of searching out homologous sequences. We define this kernel as kernel (I).
Position specific scoring kernel. To explore the statistically significant alignments produced by BLASTP with the positionspecific score matrix (PSSM), PSI-BLAST generates a score to the iterated gapped multiple alignment over a set of sequences [8]. We treat the PSI-BLAST score directly for generating the kernel matrix computation, as it represents the similarity of homologous sequences in descending order more accurately than BLASTP [1,34]. Unfortunately the matrix formed directly from PSI-BLAST scores between pair of sequences is not positive semidefinitive in nature, as all-vs-all PSI-BLAST scores are not symmetric for a pair of sequences. However if P is the PSI-BLAST similarity score matrix, then P is symmetric with singular value decomposition P~U T DV where D is the diagonal matrix diag(l 1 , ::: ,l n ) with singular value entries l 1 § ::: §l n §0. Therefore we define the PSI-BLAST kernel by where y(D)~diag(y(l 1 ), ::: ,y(l n )) and y(l)~1zl if lw0, and 0 otherwise. We normalize the kernel with unit sphere projection . We identify this kernel as kernel (II). A related protein structure kernel, based on MAMMOTH score [12] previously yielded good performance in classifying proteins. Markov cluster similarity scoring kernel. The Markov Cluster algorithm(MCL -http://micans.org/mcl/) [18] is a fast and reliable approach for complicated domain structures [20], which simulates random walks on a graph to detect the transition probabilities among its edges using Markov matrices. Several existing methods including TribeMCL [20] and OrthoMCL [19] apply the MCL algorithm to detect protein clusters which consists of multi-species orthologs or recent paralogs. The scoring matrix used for MCL clustering in OrthoMCL algorithm is initially computed as the average {log10(P{value) from pairwise WU-BLASTP similarities. These weights are then normalized dividing the averaged edge weights W i,j of all ortholog pairs of two species i and j by average weight W of all multi-species ortholog and ''recent'' paralog pairs [19]. This minimizes the impact of ''recent'' paralogs in cross-species ortholog clusters. Therefore this normalized score emphasizes the remote homologs better than the BLASTP scores and also reduces the impact of ''recent'' paralogs in classification. We generate another kernel matrix using this score, which solves the diagonal dominance issue for K(x,x) to be orders of magnitudes larger than K(x,y), by assigning arbitrary values to K(x,x). To satisfy the positive semidefinitive property in this kernel, we utilize the neighbors and the profiles information to transform this matrix.
Neighborhood similarity kernel. We incorporate the neighborhood probabilistic representation of each input sequence over the above explained MCL similarity scores, following earlier neighborhood mismatch kernel [22]. Initially we compute the neighborhood feature vector over the MCL scores and then generate neighborhood similarity matrix in equation 1. However to satisfy the positive semidefinitive property of our kernel we compute the singular value decomposition of this matrix. We normalize the generated kernel to the [0,1] interval. We identify  Mismatch profile kernel. To construct the kernel based on profile information, we generate a variant kernel with MCL similarity and PSI-BLAST profile-based scores. Following the profile mismatch kernel based on spectrum kernel [23], we develop our kernel using the probabilistic profiles of sequences over the neighborhood of the Markov cluster similarity kernel. The singular value decomposition over our feature vector with the [0,1] interval normalization generates our new kernel with semidefinitive property. We identify our OrthoMCL Mismatch Profile (OMCL MP) kernel as kernel (IV).

Combined spectral kernel clustering
The position specific scoring kernels are based on the singular value decompositions and therefore, are Mercer's kernels. Again the neighborhood similarity kernel and mismatch profile kernel are also proved to be Mercer kernels. We define the kernels combining PSI{BLAST with OMCL NM and OMCL MP kernels as kernels (V, VII) and similarly the combined BLASTP kernels with OMCL NM and OMCL MP kernels as kernels (IV, VI). Therefore our combined local alignment kernels (V, VI, VII, VIII), which are the tensor products K(i,j)~K 1 (i,j):K 2 (i,j) of those simple alignment and modified Markov cluster similarity kernels are also valid Mercer's kernels [35][36][37].
For unsupervised classification, we apply the spectral clustering method directly to the combined local alignment cluster kernel matrices without using a transductive setting like in [22]. [12] established the well-clustered approach of the spectral clustering over protein sequences. However this random walk based graph partitioning method solves the problem to identify the tightly coupled clusters, and cut the inter-cluster edges. Thus explicitly removing the promiscuous domain problem.
This algorithm also constructs the Markov transition matrix as used in Markov Clustering algorithm (MCL) [20], but differs in the analysis of the perturbation to the stationary distribution following a Markovian relaxation process [12] to utilize the eigenvectors corresponding to the leading eigenvalues of the matrix. As this method does not need to modify the random walks with a relaxation parameter called 'inflation' in OrthoMCL [19] and TribeMCL [20], it outperforms those methods in the accuracy of   the result clusters with respect to the true classifications.

Modified symmetry in kernel space
The modified-symmetry based distance measure d c [26], as defined in equation 3 considers the nearest neighbor of symmetrical points among clusters to compute distances. The distance of a point and its nearest neighbor in the Hilbert space produces significant higher values for the case of outliers. Therefore scaling it with the euclidean distance between the point and the centroid distinguishes outliers with much higher values. Correcting clusters with lower modified symmetry norm (d c ) value imposes compact clusters reducing outliers over kernel space. We can define the modified symmetry based reassignment of a point x to cluster c as: c~argmin k~1,:: where C k~C entroid of kh cluster and d c as defined in Eq 3. Furthermore to prove the non-negative definiteness in spectral kernel with modified symmetry, for arbitrary fx 1 ,:::,x n g, we can show that: where d s (x i )~d s (x i ,C k ) and d e (x i )~d e (x i ,C k ) are related to c in Eq 5 using Equation 3 and are always §0 Vi,j.
Therefore the spectral kernel matrix with modified symmetry norms is itself positive semidefinitive in nature. Alternatively, let is a positive semidefinite spectral kernel. Then for arbitrary fx 1 ,:::,x n g and if e represents d e (x i ) and e[Rn, then we obtain: where c[R n and any c i~ds (x i )d e (x i ) following Eq 5. TheforeK K is a valid kernel function. Accordingly, we correct the combined spectral kernel results with modified symmetry with reallocating proteins to a cluster with its optimal modified symmetry distance norm less than the pre-defined threshold h~0:18 [25]. With respect to the original ''true'' clusters, this yields to create good overlapping symmetrical clusters, which are more relevant to homology detection as discussed in Section0. We define the spectral clustering solutions after modified symmetry based redistribution for the combined BLASTP kernel with OMCL NM and OMCL MP kernels as respectively kernels (IX, XI) and combined PSI-BLAST kernel with OMCL NM and OMCL MP kernels as respectively kernels (X, XII).

Results
In this section the framework for the experiments and comparative results of all local alignment kernels and combined spectral kernels after modified symmetry based correction are described. The comparative study of the clustering solutions of the existing string [28] and mismatch [22] kernels are also included in this section. Similarly we perform the experiments over one large dataset also to evaluate performances of all the kernel algorithms.

Evaluation framework
Several frameworks have been implemented for demonstatating the performance of twelve different kernels proposed in this article. The PSI{BLAST [8,34] iterations with composition based statistics [38] are performed on a Cluster with 62 Opteron nodes [2:60 GHz, 322:4 GFLOPs] using MPIBlast and the command-line program blastpgp. We implement OrthoMCL version 2.0 [19] for our experiments. All the kernels are generated in Matlab v7.10 (R2010a) 64-bit. The normalized spectrum kernel with subsequence/string length = 4 settings in the Kernel-based Machine Learning Lab (kernlab) package [39] in R [40] from CRAN is used. This is utilized for spectral clustering [29] over all our local alignment and combined kernel matrices. The spectral clustering results of all methods are evaluated using the receiver operating characteristic (ROC) score, commonly called Area Under ROC Curve (AUC) and the ROC-50, which is the ROC score or AUC computed only up to the first 50 false positives. For the ROC [41] analysis of the kernel matrices, ROCR packages [42] have been used. Finally the CRAN statistical package R [40] with RCommander library [43] have been used for Wilcoxon signed rank test. The modified symmetry based clustering approach using MPICH has been implemented. We utilize the existing string kernel of LIBSVM [28] software for comparing its results with our kernel clustering results. We also experiment over our chosen dataset with the pre-existing spectrum mismatch [22] kernels on SVM. To verify the performances over a large dataset, we execute all our proposed as-well-as those already-existing kernels over the chosen 54 target families from SCOP version 1.59 [22] from literature as mentioned in Data section. We also utilize the linear kernel with SVM of SPIDER [44] framework in MATLAB to obtain the comparative results. Table 1 summarizes the performance achieved by the local alignment based kernels for family-level classification implemented with spectral clustering. We measure the performance of BLASTP kernel(I), PSI-BLAST kernel(II), OrthoMCL Neighborhood Mismatch (OMCL NM) kernel(III) and OrthoMCL Mismatch Profile (OMCL MP) kernel(IV) to classify the multi-domain protein families of our dataset with mean ROC and mean ROC50 scores. These results show that OMCL MP kernel(IV) performs best over all other methods indicating the influence of profiles in homolog detection. All the modified local alignment kernels outperforms simple score based kernels in this experiment. As an illustration, the distribution of ROC50 scores for all local alignment-based kernels is shown in Figure 1. The number of families whose ROC50 scores are greater than a given threshold in the range [0,1] are shown in Figure 1. All modified kernels from OMCL scores, namely OMCL NM(III), OMCL MP(IV) kernels retrieve approximately two times more ROC50 scores than the two simple score based BLASTP(I) and PSI-BLAST(II) kernels for similar number of families.

Performance of combined spectral kernels
In order to investigate the performance of our spectral kernels over simple alignment kernels, we combine all modified local alignment kernels using normal product. Combining PSI{BLAST with OMCL NM(VI) and OMCL MP (VIII) kernels provide respectively ROC values 0:757 and 0:773 in Table 2, which is superior to the values 0:738 and 0:752 obtained by combining BLASTP kernel respectively with OMCL NM (V) and OMCL MP (VIsI) kernels. PSI{BLAST with OMCL MP kernel (VIII) outperforms all other methods with the highest ROC50 score of 0:773. Figure 2 illustrates the combined kernel performances of ROC50 distribution for the unlabelled protein family classification. The basic BLASTP (I) and PSI-BLAST (II) kernels cannot successfully perform in the absence of sufficient positive training data for a huge unlabelled protein database [7]. Therefore combining local alignment kernels may provide improvement for unsupervised protein family classification. As shown in Figure 2 both OMCL NM (VI) and OMCL MP (VIII) kernels combined with the proposed PSI{BLAST kernel (II) consistently show superior performance while significantly outperforms other combined kernels.

Modified symmetry in protein classification
In the unsupervised setting of homolog detection, the simple score based kernels do not show very strong performance in comparison with the combined modified spectral alignment kernels. Incorporation of the modified symmetry based cluster correction imporves the performance further (see Table 3) for unlabelled data. In comparison with the ROC and ROC50 scores shown in Table 2, all combined spectral kernels show better performance after modified symmetry-based enhancement in detecting homologs. The most striking observation from this result is that the major impact of modified proximity norm d c in ROC50 scores of 0:798 and 0:789 for two combined PSI{BLAST spectral kernels (X, XII). Figure 3 shows the ROC50 distributions for all combined BLASTP and PSI{BLAST kernels after modified symmetry based corrections (IX, X, XI, XII). These results show that PSI{BLAST kernel combined with OMCL NM and OMCL MP kernels after modified symmetry based redistribution (X, XII), consistently outperform other combined kernels with higher ROC50 values. Figure 4 shows a family-by-family comparison of the ROC scores of PSI{BLAST kernel combined with OMCL NM and OMCL MP kernels (VI, VIII). The points fall approximately near evenly above and below the diagonal, indicating similar performance of both methods. However there exists more points on upper triangle of the Figure 4 which proves a little superiority for PSI{BLAST kernel combined with the OMCL MP kernel (VIII). Figure 5 shows the family distribution for ROC50 scores of BLASTP kernel (I) and its improvement after combination with the OMCL MP kernel including modified symmetry based enhancements (XI). For most of the families, the BLASTPzOMCL MP kernel after modified symmetry based reassignment (XI) provides higher ROC50 scores than simple BLASTP kernel (I). All the experiments demonstrate the utility of combined spectral kernel approaches with modified symmetry corrections in the remote homolog detection.

Discussion
We have presented and experimentally evaluated twelve spectral kernels for remote homology detection that classify protein sequences in comparison with the explicit evaluation of modified symmetry based proximity norm. These kernels measures sequence similarity on the unlabelled data. For this unsupervised protein family classification approach, we focus on our spectral clustering approaches with combined local alignment score-based valid kernels. This approach performs competitively with state-of-the-art neighborhood [22] and profile [23] mismatch kernel methods. When we experiment with introducing modified symmetry in kernel space for homolog detection, our methods outperform earlier known cluster kernel methods in this setting.
Weston et al in [22,23] introduced the neighborhood and mismatch profile concepts on the BLASTP and PSI-BLAST scores earlier. However, they did not experiment with positive-semidefinitive kernels after singular value decomposition of BLASTP (I), PSI-BLAST (II) and newly experimented OrthoMCL scores for kernel formations (III, IV). After combined with neighborhood similarity and mismatch profile features (V, VI, VII, VIII), our proposed Mercer kernels provide significant solutions after introducing modified symmetry based updating (IX, X, XI, XII) in spectral clustering results.
Four major observations can be made by analysing different experiments presented in this article. First, the direct use of localalignment based BLASTP and PSI-BLAST scores to create a kernel matrix with singular value decomposition (I, II) proves to be a valid kernel for homology detection. Second, as discussed earlier in coperation of previously detected OrthoMCL scores to reduce the ''recent'' paralog effects in BLASTP/PSI-BLAST results gains significance. The neighborhood similarity and the mismatch profile kernel over OrthoMCL scores (III, IV) also proves to be significant in comparison with earlier cluster kernels, reducing the diagonal dominance issue with arbitrary lower magnitude distribution of diagonal values. Third, we do not need to diagonalize the matrix of all labelled and unlabelled data as in [22]. The leading eigenvectors over the kernel matrix in our spectral clustering implementation. It improves the sensitivity over the all-vs-all local alignment scores for the global distance computation to all proteins without using any hard cut-off threshold. Implicit reduction of inter-cluster edges in spectral clustering also demotes promiscuous domain problem. Without using any relaxation to random walks by restricting to a one-toone allocations for all proteins among all families it solves this problem, which TribeMCL [20] did with the inflation parameter as a relaxation over the random walks. Four, the modified symmetry based reallocation in kernel space imposed to be biologically significant to exclude outliers as discussed earlier. The intra-symmetrical clusters represent more compact set of homologs based on their similarity scores in the kernel matrix. The nearest neighbors within same cluster represent homologs with similar domains. Smaller distance with the nearest neighbor therefore signifies more compact clusters in kernel space and the nearest neighbors in different clusters represent homologs in different domains. Therefore detecting modified symmetry among multidomain homolog proteins classifies the protein to a cluster of proteins. The clusters show more accurate domain selection with closer nearest neighbor homologs expressing more biological significance.
Both the widely used cluster kernels [22] and OrthoMCL [19] produce efficient clusters even in the context of remote homolog detection in multi-domain protein families. This fact is reassuring to the validity of our approaches to capture more statistically significant protein clusters with biological relevance of modified symmetry correction.

Statistical performance evaluation
To evaluate the statistical significance of the differences in the performances observed among all spectral kernels, we perform Wilcoxon signed-rank tests on the area under the ROC50 curve of all simple score-based local alignment kernels, combined spectral kernels and the results after corrections with modified symmetry. Table 4 shows the outputs of this test. Method A outperforms method B according to Wilcoxon test with pv0:05. The signedrank results show expected trends of superiority of position specific scoring, modified symmetry based corrections and the OMCL MP kernel over OMCL NM kernel. The median difference values between two methods in Table 4 show the consecutive improvement in cluster results of local alignment kernels after combinations and modified symmetry based updations over them.

Quantitative performance evaluation
We evaluate the clustering solutions for all kernels objectively by measuring five validity measures Dunn, Davies-Bouldin, Kruskal, Rand and Jaccard indices as defined in [45], [46], [47], [48] and [49] respectively in Table 5. The Dunn validity index [45] shows increasing values for better performance. As a further quantitative evaluation, for the PSI{BLAST kernel after modified symmetry based corrections and combined with the OMCL NM (X) and OMCL MP (XII) kernels respectively provide Dunn's index values of 0:041 and 0:068 in Table 5. Similarly, the Davies-Bouldin (DB) index [46] value shows better clustering solutions with combined PSI{BLAST kernel over combined BLASTP kernel with decreasing values for 1:741 and 1:574 for BLASTP z OMCL MP (VII) and PSI{BLAST z OMCL MP (VIII) kernels in Table 5 respectively.
The increasing values of 1:145e{2 and 1:327e{2 for Kruskal index [47] in Table 5 for OMCL NM (III) and OMCL MP (IV) kernels over those values 5:566e{3 and 5:826e{3 respectively for BLASTP (I) and PSI{BLAST (II) kernels, shows the significance of the Markov cluster similarity scoring kernels considering neighborhood similarity and mismatch profile respectively. The Rand index [48] shows the increasing superiority of clustering solutions for OMCL NM (III), BLASTP z OMCL NM (V) and PSI{BLAST z OMCL NM (VI) kernels respectively with increasing values of 8:020e{1, 8:263e{1 and 8:489e{1 in Table 5 for the quantitative evaluation. The better increasing values of Jaccard index [49] with 3:824e{2, 4:017e{2, 4:262e{2 and 4:289e{2 values in Table 5 for PSI{ BLAST OMCL NM (VI), PSI{BLAST OMCL NM z Modsym (X), PSI{BLAST OMCL MP (VIII) and PSI{BLAST OMCL MP z Modsym (XII) kernels respectively further show the significance of modified symmetry based corrections over the clustering solutions provided by the combined local alignment spectral kernels. This shows superiority of the combined kernels even over local alignment kernels proving OMCL MP kernel more significant than OMCL NM kernel.

Comparative performance evaluation
We evaluate the clustering solutions of our proposed kernels comparatively with those of the already-existing linear [44], mismatch [22] and string [28] kernels. We experiment those mismatch [22] and string [28] kernels over the BLASTP, PSI-BLAST and OMCL matrices to obtain ROC50 scores provided by those kernels. In Table 6, the ROC50 scores provided by those existing kernels are shown. The ROC50 scores of our proposed kernels in Tables 1, 2, 3, 4 show superior efficiency with higher ROC50 scores. Similarly, to experiment with a large dataset, we run all our proposed kernels as-well-as the state-ofthe-art linear [44], string [28] and mismatch [22] kernels on SVM over the existing dataset with 54 families from SCOP version 1.59 [23]. We experiment the existing linear [44] and string [28] kernels over this dataset and compare it with existing results of Spectrum Mismatch kernel [22]. We also experiment our proposed BLASTP, PSI-BLAST, OMCL NM and OMCL MP kernels over this dataset. We also compare the kernel outputs further after the modified symmetry based enhancements. All the ROC50 scores of the clustering solutions provided by all algorithms are included in Table 7. The higher ROC scores provided by our proposed kernels also show superior values over the existing kernels.

Conclusions
The homologous protein family detection tool within Hemiascomycete yeast complete genomes are appreciated in genomics to detect the conservation of function. Therefore, we propose a computational approach for computing local alignment based Mercer kernels utilizing Markov similarity to reduce ''recent'' paralog effects. Introducing profile mismatching and neighborhood feature vectors in combined Mercer kernels for spectral clustering, effectively escalates remote homolgy detection from unlabeled protein sequences database. We experiment the corrections by the modified symmetry based proximity norm producing improved clusters with reduced outliers/singletons and selecting more biologically significant domains for multidomain proteins. Our position specific scoring kernel combined with the modified symmetry based corrections, achieves state-ofthe-art prediction performance in the context of unsupervised homology detection. When combined with Markov cluster similarity kernels in well-known neighborhood feature space and considering neighborhood mismatch based on profiles, this approach performs superiorly over other cluster kernels. Therefore to detect the homologs among multi-domain proteins, our spectral clustering approach with combined local alignment kernels results in clusters having better more biological significance. We suggest that this is achieved due to the incorporation of the modified symmetry based corrections in kernel space.