Accelerating Information Retrieval from Profile Hidden Markov Model Databases

Profile Hidden Markov Model (Profile-HMM) is an efficient statistical approach to represent protein families. Currently, several databases maintain valuable protein sequence information as profile-HMMs. There is an increasing interest to improve the efficiency of searching Profile-HMM databases to detect sequence-profile or profile-profile homology. However, most efforts to enhance searching efficiency have been focusing on improving the alignment algorithms. Although the performance of these algorithms is fairly acceptable, the growing size of these databases, as well as the increasing demand for using batch query searching approach, are strong motivations that call for further enhancement of information retrieval from profile-HMM databases. This work presents a heuristic method to accelerate the current profile-HMM homology searching approaches. The method works by cluster-based remodeling of the database to reduce the search space, rather than focusing on the alignment algorithms. Using different clustering techniques, 4284 TIGRFAMs profiles were clustered based on their similarities. A representative for each cluster was assigned. To enhance sensitivity, we proposed an extended step that allows overlapping among clusters. A validation benchmark of 6000 randomly selected protein sequences was used to query the clustered profiles. To evaluate the efficiency of our approach, speed and recall values were measured and compared with the sequential search approach. Using hierarchical, k-means, and connected component clustering techniques followed by the extended overlapping step, we obtained an average reduction in time of 41%, and an average recall of 96%. Our results demonstrate that representation of profile-HMMs using a clustering-based approach can significantly accelerate data retrieval from profile-HMM databases.


Introduction
With the exponential growth of biological sequence data, there has been an increasing interest in classifying records that share similarities in sequences, functions and structures together into families [1]. The representation of the protein, DNA, or RNA sequences into families has proven very useful to gain insight into the biological functions, molecular mechanisms, and evolutionary relationships of these biological molecules. Furthermore, detection of remote homology is more sensitive when a sequence query is searched against a family database than a combined with different searching algorithms to increase speed and save several minutes to hours for large batch queries, with an acceptable cost on recall value.

Materials and Methods
The profile-HMMs of the TIGRFAMs database [6] were used to demonstrate the proof-ofconcept of our proposed approach. We used TIGRFAMs release 13.0 with 4284 families. The 4284 profiles represent a total of 55503 protein sequences. For each protein family, TIGRFAMs has three components. The first component is the set of protein sequences that belongs to the family. These proteins are stored as multiple sequence alignments (MSA) and are usually referred to as seed sequences. The second component is the profile-HMM for the MSA of the first component. The third component is the associated information designed to support the automated functional identification of proteins by sequence homology.
HHsearch tool [2] was used for building profile-HMMs from the input MSA using hhmake script. To create profile-HMM databases for the representatives of each cluster data set, we used hhblitsdb.pl script. The script hhsearch was used to search and retrieve matches from the created profile-HMMs databases for a given input query.
Protein sequence data from the UniRef50 data set [24] were used for parameter tuning and validation. UniRef50 is one of the protein cluster data sets that are available through Uniprot database [25]. The seed sequences of each UniRef50 cluster have at least 50% sequence identity and 80% overlap with the longest sequence in the cluster. For parameter tuning purposes we randomly selected 100 sequences from the UniRef50 data set. However, for final validation, we randomly selected 6000 sequences.
Implementation was carried out using MATLAB [26] in addition to shell scripting on Cen-tOS Linux operating system. Shell scripting was used in order to effectively utilize the number of cores in the server in a multi-threaded manner. The server has 32 cores (2.2 GHz) with 128G as a total RAM size.
The testing phase was carried out using a single core computer in order to study the performance of the proposed approach when running it on low computational power servers.

Algorithms
In this section, we present two approaches for building the retrieval system based on the both 'crisp' and 'overlap' clustering.

Crisp clustering approach
This approach consists of four steps. A detailed explanation of each step is discussed below: Retrieval of protein family information. In this step we retrieve all the sequence seeds that are used to build the profiles in TIGRFAMs database. We use these seeds to create a Profile-HMMs using hhmake script in HH-SUITE [2]. The reason to create profiles using HH-SUITE instead of using the TIGRFAMs is to increase the homology detection sensitivity as recommended by Söding [2] in the HH-SUITE manual.
Build a similarity matrix. Assume Λ = [λ 1 , λ 2 , . . ., λ N ] is a database of N profile-HMMs. An iterative process is used to compare each profile λ i against all profiles in Λ. The local alignment option of HHsearch was used as it can produce more similarity results and hence increases the chance of generating clusters of profiles. This comparison process produces an output with N scores. The scores Each represents the profile-profile comparison score between λ i and each profile in Λ. These scores are identified based on a measure that was developed by Söding, which was called probability. This measure is calculated not only based on the resulted E value of the search, but it also takes into account the secondary structure similarity. According to Söding, using the probability measure is more sensitive to identify homology. The probability can range from 0 to 100% and when it is larger than 95%, the homology is nearly certain [2]. We adopted this score as the similarity score throughout this paper.
These scores are used to produce an all-against-all similarity matrix S. The matrix S has N × N elements, where the element s ij = score(λ i , λ j ) represents the similarity score between query λ i and template λ j . It is important to notice that score(λ i , λ j ) 6 ¼ score(λ j , λ i ), i 6 ¼ j, because the score depends on the length of the query. As S is an a symmetrical matrix, we decided to use the approach in [27] by only storing the maximum score of s ij and s ji .
Clustering. At this point, the matrix S is used as input for the clustering process. S is considered as a data set of N samples, each sample has N dimensions.
Three different clustering algorithms, k-means, hierarchical, and connected component were used.
Assignment of representative profile. A representative γ i for the cluster c i is selected using two alternatives. The first method (which we denote as MSA-representative) builds the representative by retrieving all profile seeds in c i and creating their multiple sequence alignment (MSA) before creating a profile-HMM based on this MSA. The second method (which we denote as heuristic-representative) is to select a profile λ i from c j as a representative for cluster j using a heuristic function f. This function finds the profile in the cluster that has the highest homology to all other profiles as follows: Given a cluster of size T. The similarities among its profiles are in S 0 . Where S 0 is a subset of S. The function f(λ i ) returns a value that represents the likeliness of having λ i , as heuristic-representative.
The heuristic-representative is calculated by selecting the profile with the maximum similarity summation among all profiles in the cluster (i.e. argmax(f(λ i ))). Fig 1 shows the diagram of the proposed technique. The system first aligns the input query λ q to Γ, which is a data set of the Representative profiles to obtain the best match. Then, the system performs a search inside the cluster from which the best matching profile originated in order to find similar profiles. The final output is a list of similar profiles in the database that matches the input query using local alignment with a similarity score greater than a threshold ϕ.

Overlapping clustering approach
In order to enhance the sensitivity of searching a further step was introduced to the four steps mentioned above. The enhancement was achieved by proposing an extended step that allows overlapping among clusters. The overlapping was made by assigning a representative for each cluster as in Section 2.1, then re-assigning the profiles in the database to clusters as follows: Suppose that C = [c 1 , c 2 , . . ., c k ] are the k clusters that are produced after clustering Λ with any of the three clustering algorithms, and Γ = [γ 1 , γ 2 , . . ., γ k ] are the k representatives for clusters in C.
Then, each profile λ j is assigned to cluster c i in C as Eq (2) shows.
where ϑ is a self-regulated threshold that is calculated as in Eq (3). This threshold is computed as the mean value of similarity vector between λ j and profiles in Γ. The self-regulated threshold guarantees that each profile is assigned to at least one cluster.
The above is one heuristic way to select the threshold for the overlapping. Nevertheless, this opens other ideas that can be researched in the future.
Although sequential techniques are suppose to find the most similar profiles, they are slow since they perform N comparison operations for a given search in a database of N profiles. On the other hand, indexing speeds up the search time but may cause the query profile to be mismatched with the correct cluster.
Three scoring stages exist in our system. The first one is when comparing the query with the representatives in Γ, which costs K comparison operations. In the second stage, for the best representative match γ i , the technique starts a sequential search inside c i cluster to find the matches for the input query. In the final stage, a process of sequential searching is performed within singleton clusters [28] (clusters having one profile). For an singletons, the technique will perform an additional M comparisons. The total number of comparisons needed for retrieving a query is denoted by NC as shown in Eq (4).
where c i is the cluster of the most similar profile to the query. In Eq (4), the ideal case will be achieved if all clusters have the same size and M = 0. On the other hand, the worst case is when the clustering algorithm produces only one cluster which eventually exceeds the sequential search due to the indexing overhead. In this work, we tune the parameters of the clustering approaches empirically to have a balance between the number of clusters K and the size of each cluster as shown in section 3.1. It is important to note that our proposed overlapping extension step guarantees that an M = 0 case is achieved.

Evaluation criteria
The evaluation of the proposed retrieval technique was performed based on two criteria. The first one is the reduction in time (RT), which is the ratio of the time to accomplish retrieving a query from S in comparison to the sequential search time, subtracted from 1 as shown in Eq (5).
where S = [s 1 , s 2 , . . ., s T ] is a query set of size T, α is the search time using the proposed technique and α 0 is the search time using sequential technique. The second criterion is Recall, which is the ratio of the number of relevant matches retrieved by the proposed technique to the total numbers of relevant matches, where β is the set of matches using the proposed technique and β 0 is the set of matches using the sequential technique.
In information retrieval contexts, Recall is usually accompanied with Precision, which detects the number of irrelevant matches.
In the case of information retrieval from Profile-HMM Databases, only highly similar matches are considered as homologous matches. For example, using local alignment in the sequential approach in [2], a similarity scores that are above or equal to 95% are considered as homologous matches. We follow this notion in the current research and we assume that the sequential retrieval results are all relevant. The set of retrieved matches in the our proposed method will always be a subset of the matches in the sequential technique. Consequently, Precision is not taken into consideration in this research.
To generalize the concept for calculating Recall for each experiment, we formulate Eq (7), which reflects the ratio between the total number of matches found by the retrieval technique to the total number of matches found using the sequential search.
The reduction in time and recall ratio are measured during the retrieval process of any input query. We have not taken into account the time for establishing the clusters nor the time for establishing the representatives in our experiments. This is because the establishment process is done only once. It is still worth mentioning that the establishment process is time consuming. For example, in the case of k-mean clustering, more than 300 hours were required to build the MSA-representative when k = 160.

Results and Discussion
Experiments were carried out to study the following: The performance of the three clustering algorithms; approaches for assigning representative, and the effect of introducing overlapping.

Parameter tuning
In order to select the best values for the clustering parameters number of clusters (K), cutoff the hierarchical tree (z), and edge weight (ψ), we used an independent data set of protein sequences. These sequences were randomly retrieved from UinRef50 in order to minimize any possible bias towards data from our profile-HMM TIGRFAM database. In the experimental details section, we provided detailed results of different experiments that aimed at tuning clustering parameters. The selected values of the three parameters are shown in Table 1. It is important to underscore that the obtained parameters are specific to TIGRFAM release 13, which was used for validating our approach. Therefore, a different database would require re-tuning these parameters.

Assignment of representatives
During the tuning of the parameters of each clustering approach, we assigned representatives for each cluster using heuristic-representative approach. In this part of the experiment, we studied the other alternative of selecting the representative which is MSA-representative.
Six different comparisons were performed to cover all combinations of having three clustering algorithms and the two methods for choosing the representatives. We used the same parameter values as in Table 1 Figs 2 and 3 show the evaluation for each clustering algorithm using the two methods of assigning representatives.
In the case of k-means and connected component algorithms, better recall values were reached when a heuristic-representative was used compared to MSA-representative. There was no significant difference in recall value between heuristic-representative and MSA-representative when using the hierarchical clustering algorithm. These results are consistent with the cluster distributions for each approach in Figs 4-6. The maximum cluster size in the hierarchical algorithm was 57, while it equals 1818 and 942 for k-means and connected component, respectively. As for reduction in time, both ways of selecting the representative lead to a similar reduction in time. So, the method of selecting the representative has a greater effect on accuracy than on time reduction.

Application of overlapping
In crisp clustering techniques, each item is assigned to only one of the existing clusters. It is possible to have items shared between more than one cluster by introducing overlapping. This is suitable in our case because we employ local alignment, where one or more local segments from the same profile can be similar to different profiles segments. However, the profile similarity is not transitive through the scoring process that was used. This means if score(λ a , λ b ) ! ϕ, and score(λ b , λ c ) ! ϕ under a given threshold ϕ, it does not necessarily mean that score(λ a , λ c ) ! ϕ.
If we use crisp clustering, λ a , λ c may be assigned to two different clusters and λ b is assigned to one of them. However, λ b should be assigned to both clusters at the same time because it is similar to both of them. This is handled by the overlapping technique. Figs 7-9 show how the clusters were distributed after applying overlapping. The figures show that the overlapping has led to a normal distribution where we have small frequencies when considering the very large and the very small clusters compared with the other clusters in the same experiment.
Figs 10 and 11 illustrate the score measures after applying overlapping to all selected clustering algorithms. Again, this technique is tested with the best experiments for each clustering algorithm, which are used in Section 3.2. Fig 10 shows that all clustering algorithms achieve better recall after applying the overlapping. It is worth noting that the hierarchical clustering algorithm was enhanced by 12.5% degrees and achieves 100% recall for the input test data, while k-means achieved 99.45% recall with an enhancement of approximately 6%. Lastly, connected component achieved an additional 2% enhancement compared to the crisp clustering technique.
In Fig 11, we can see that the hierarchical and the connected component algorithms with overlapping has little impact on the results in terms of reduction in time. In the case of the kmeans algorithm, the reduction in time is better without overlapping to high extent. This major change in reduction in time can be explained by comparing Figs 8 and 5. In Fig 5, kmean has clustered the profiles without creating any singletons. However, in Fig 8, where overlapping is introduced to k-means, the size of clusters has increased in general due to having the same profiles assigned to many clusters. This caused an overhead in the computation time.

Extended evaluation with larger data-set
The previous experiments were conducted for tuning the system parameters. In this section, we carried out a set of experiments with the best-obtained parameters. A larger dataset of 6000 UniRef50 sequence was involved. This dataset was selected randomly from all the UniRef50 data set. We study the recall and reduction in time under different query sizes ranging from 1000 to 6000 sequences.   Table 2 shows the recall for the three different clustering algorithms using different query sizes. We can see that hierarchical clustering has the best average recall value among the three clustering algorithms. Since a larger dataset is involved here, the results are more reliable than the results in the tuning phase. Still, each algorithm occupied its relatively recall order among the others. Table 3 shows the reduction in time for each algorithm. A higher variety of sequences with widely different sequence lengths were involved in the experiment. When comparing the reduction in time among the three clustering approach, we can see that the values are very   close to each others. This is because the three clustering approaches have highly similar cluster distribution after applying the overlapping to them as seen in Figs 7-9.

Time Reduction
The proposed approach aims to minimize the retrieval time in profile-profile approach. An experiment to handle a batch query of 1000 sequences took 15.3 hours using the sequential search while it required 9.13 hours with our proposed approach.

Conclusion
This work proposed a novel information retrieval approach to accelerate the search in profile-HMM databases for homology detection. Unlike the existing techniques, we shift our focus from improving the searching algorithm to reducing the searching scope by representing the database in the form of clusters. From the many existing clustering approaches, we selected three commonly used ones in order to prove the main concept behind this study. These approaches are: k-means [29], hierarchical [30], and connected component [31]. These approaches are known as crisp clustering approaches [32]. This means that each profile in the dataset is assigned to only one cluster. To enhance the retrieval process, we introduced an extension step to the clustering process that allows overlapping among the clusters. The overlapping step has enhanced the recall of the system. This was clear in the hierarchical clustering algorithm which produced the best recall over the two other approaches.
When applying connected component and hierarchical clustering, we noticed that many profiles are singletons. which increased the search time. This paper introduced an overlapping technique to solve the search overhead caused by the singletons and to enhance the recall in data retrieval. It is important to emphasize that overlapping among clusters should not be taken into consideration when studying the biological similarities among profiles.
It is important to point that the proposed technique is highly expandable when new profile-HMMs are included in the database. This can be performed by applying the overlapping step to each new profile, and update the related clusters without the need to re-cluster the whole database. Finally, the proposed system can be used to accelerate any other alignment tool other than HHsearch.

Setup Time
The establishment of the matrix S as well as creating the database of clusters were performed in a multithread manner (32 threads) to obtain it at high speed, while the other steps were carried out using a single core computer in order to study the performance of the proposed approach when running it on low computational power server. and building the clusters databases. In building the clusters database, the listed time is the overall turnaround time not the summation of thread's time. The resulting clusters which are taken from the leaf nodes of the hierarchy based on the cutoff value, are distributed as in Figs 12,4 and 13. The x-axis displays the size of clusters that are produced by this clustering algorithm, while the y-axis shows clusters frequency for each size. In Fig 12, this algorithm produced 1944 clusters that hold only one profile, while 18 clusters hold 5 profiles and so on. From these figures, we can see that the case of singleton clusters appears in this clustering algorithm. When increasing z, the number of singletons decreases and, consequently, the size of clusters increases.

Parameter Tuning for hierarchical clustering
If we ignored the singletons, we would have 895, 818, and 752 clusters as a final output for each value 0.5, 0.9, and 1.1, respectively. After preparing all the clusters, the methods for choosing the representatives using a heuristic function were used for testing the technique. The measurement scores for different z values appear in Figs 14 and 15. The measures show that a medium value of z gives the best recall value of 87.43%, while the reduction time is 39.26%. The best reduction time obtained by minimum z with a score value of 41.85% and recall value of 86.88%.

Parameter Tuning for k-means clustering experiments
A set of experiments was performed to tune the k values (the number of clusters). The squared Euclidean distance was used to calculate the centroid for each cluster. Each centroid cannot be k-means algorithm initializes the centroid seeds randomly and this can lead to poor clustering. One way to avoid this problem is to have the clustering process repeated many times and obtain the best clustering pattern. However, to avoid repeating the clustering many times, we relied on k-means++ algorithm [33] under MatLab. This algorithm uses a heuristic process for spreading out the initial centroid seeds. When the number of clusters increases, the number of small clusters also increases, bearing in mind that there is always one relatively large cluster. The size of clusters is inversely proportional to the number of clusters. Finally, we stopped the experiments when we reach k = 600 The best scores were achieved when k = 100, with recall value being 91.80%, and the reduction in time being 83.25%. As the number of clusters increased, both recall and reduction in time were negatively affected. It is logical for there to be a decline in the reduction of time measurement as the number of clusters increases, since this increase also causes an increase in the number of representatives, which would necessitate more comparisons for each query. In addition, having more clusters usually leads to higher possibility for the query to miss the target cluster i.e. (decline in the recall value) Since the best results were achieved when the value of k is at the boundary, another set of experiments was run to cover a closer interval for the best peak point. These results appear in Figs 24 and 25, and shows that the k = 100 region was considered a good local maximum interval, with the best recall value in that region being when k = 110 with a score of 92.34%. Out of  all k values, the best recall value was achieved when k = 160 with a score of 93.44%, and a reduction in time of 83.53% and 82.75% for k equals 110 and 160, respectively.
Another point worth noting is that as K decreases the reduction in time measurement is negatively affected. The number of representatives is small, which requires a fewer number of comparisons to be made between the query and representatives. However, the size of the clusters increases, which increases the number of comparisons required for each query. A cluster distribution when k = 160 is displayed in Fig 5.

Parameter Tuning for connected component clustering
In connected component clustering algorithm, the profiles are represented as vertices in an undirected graph, while the similarity between any two profiles in this graph is presented as a weighted edge from S. Here, clustering is done by dividing the graph to sub-graphs by removing all the edges that have a weight less than a given threshold ψ. In this section, we applied connected component clustering. Three values for ψ were used as a threshold; these values are  99%, 95% and 90%. The reason for using these three values is to observe how the technique will behave when a threshold goes from the maximum value possible to a value that is lower than the default similarity score which is 95% (see Section 2.3).
The cluster distribution for these experiments is shown in Figs 26, 6 and 27 respectively. From these experiments, The decrease in threshold will leads to decrease the number of clusters, a number of singletons and increase the clusters size. This is logic since relaxing threshold value means fewer edges need to be removed.  When ignoring singletons, the results become as follows: 548 clusters when ψ = 99%, 434 clusters when ψ = 95% and finally 379 clusters when ψ = 90% as a final output.
The score measurements for all ψ testes values are shown in Figs 28 and 29. The best recall value appears when ψ equal to 95% and 99% with value 89.61%, while the reduction time scores better when the threshold is 95% with value 44.38%.
Supporting Information S1 File. Support materials for executing the experiments. A compressed (.zip) file that contains all the needed scripts for executing any experiment of this work is provided. The similarity matrix is also included. The README file contains the needed details to run these scripts. (ZIP)