Large-Scale Evaluation of Molecular Descriptors by Means of Clustering

Molecular descriptors have been explored extensively. From these studies, it is known that a large number of descriptors are strongly correlated and capture similar characteristics of molecules. In this paper, we evaluate 919 Dragon-descriptors of 6 different categories by means of clustering. Also, we analyze these different categories of descriptors also find a subset of descriptors which are least correlated among each other and, hence, characterize molecular graphs distinctively.


Introduction
Molecular descriptors map molecular structures to the reals by taking physical, chemical or structural information into account [1]. A large number of descriptors have been developed to describe different properties of molecular graphs. Therefore, these descriptors can be classified into different categories depending what kind of information is used (e.g., physical, chemical or structural information) to define such a measure. The commercial software package Dragon [2] (version 6.0.26) contains 4885 molecular descriptors which are classified into 29 categories.
The problem of analyzing molecular descriptors by applying clustering techniques has been already explored [3][4][5][6]. These are usually based on using principal component analysis (PCA) and correlation-based methods for the identification of different descriptors. For example, Todeschini et al. [6] and Basak et al. [3] evaluated descriptors on a rather small collection of molecular graphs using PCA and ranked them based on the intercorrelation. In order to find similarities between molecular descriptors, Basak et al. [4,5] used a PCA-based clustering technique on both a hydrocarbon dataset and mixed chemical compounds. Taraviras et al. [7] performed a cluster analysis with 240 descriptors by using different clustering algorithms. The weak point of the just sketched approaches is that the corresponding study has not been performed on a large scale (large data sets) and with distinct descriptors belonging to several categories. Also, the optimal number of different descriptors (dimension) has not been validated statistically. In this paper, we overcome these problems.
A thorough evaluation of the vast amount of developed descriptors [1] is required to identify categories of descriptors which capture structural information differently. In our analysis we evaluate 6 categories (see next section) of structural descriptors by means of clustering. The main contribution of this paper is to explore the dimension of the descriptor space, i.e., how many different descriptors exist among all which have been introduced so far. Here, we put the emphasis on 919 structural descriptors from Dragon. In particular, we find that only a very few descriptors are different. In this context that means they are least correlated and, therefore, capture structural information differently.

Molecular Descriptors
To perform our study, we used six categories of descriptors implemented in Dragon (version 6.0.26) which are defined as follows: 1. Connectivity indices [1]: These indices are calculated from the vertex-degree of a molecular graph. The Randić index [8] is a prominent example thereof. 2. Edge adjacency indices [1]: These indices are based on the edge adjacency matrix of a graph. The resulting descriptorvalue is the sum of all edge entries of the adjacency matrix of a graph. Balaban et al. [9] developed several indices by using graph-theoretical matrices. 3. Topological indices [1]: These structural graph measures which take various structural features into account, e.g., distances and eigenvalues. The term topological index has been firstly coined by Hosoya [10]. The first and the second Zagreb indices [11] are prominent examples thereof. 4. Walk path counts [1]: These indices are defined by counting paths or walks of a graph. Here, the term walk refers to random walks which is based on using a probability measure. We point out that such indices have been listed by Todeschini and Consonni [1]. 5. Information indices [1]: These measures are based on using Shannon's entropy. To assign a probability value to a graph, Dragon uses so-called partition-based methods [12] by using several graph invariants such as vertices, edges, vertex degrees and distances have been used [12]. The so-called topological information content [13] and the Bonchev-Trinajstić index [14] are prominent examples of partition-based information indices. So-called partition-independent information-theoretic measures for graphs have been developed by Dehmer [12]. 6. 2D Matrix-based [1]: These descriptors are calculated based on the elements of so-called graph-theoretical matrices [15] by using several algebraic operations. The Balaban-like indices inferred from the adjacency matrix [2,9] are important examples of this category.
We want to emphasize that the term 'Topological indices' is here misleading and ambiguous. For example, typical information indices are based on structural features of a graph by using Shannon's entropy. So, they represent topological indices too. The same holds for all other groups which have been defined by using structural features of molecular structures and, therefore, they are topological indices as well, see [1,9,[16][17][18][19].
To perform our analysis, we calculate the descriptor values for these three datasets. We removed those descriptors which give constant and erroneous values by using the three data sets. The erroneous values are produced by those descriptors for which we have not been able to calculate a descriptor value of a network without additional physical or chemical information. Finally, we the above mentioned six categories contain 24, 301, 57, 28, 40, 469 descriptors.

Clustering Techniques
Clustering is an unsupervised learning technique which aims to find different groups or clusters of objects in data [22]. The groups are described as a collection of objects which are closer to each other than the rest of the objects [22]. An example thereof is hierarchical clustering as groups of the objects are arranged in a hierarchical order by a so-called dendogram. The objects which are clustered in one group have a higher degree of similarity than the objects which are clustered in different groups. Thus a resulting clustering solution allows to determine clusters where each cluster shows distinct property of the data. The similarity or dissimilarity between two objects is usually determined by using a Similarity/distance function which measures the similarity/ distance between data points of different objects. Examples are the Euclidean distance, the Manhattan distance or the correlationbased distance. A dissimilarity can be described as follows: Several algorithms have been developed for cluster analysis [22]. These algorithms can be divided into several categories namely partition-based clustering, hierarchical clustering, densitybased clustering, grid-based clustering and fuzzy clustering [22,23]. Thus k-means, soft k-means Clustering, k-medoids Clustering [22] are some examples representing non-hierarchical clustering methods. Hierarchical clustering itself can be divided into two categories called agglomerative and divisive clustering [22]. As known, several concrete methods thereof have been developed such as single linkage, complete linkage and average linkage, see [22].
In order to evaluate the descriptors, we perform hierarchical clustering (average linkage) by using the mentioned Dragon descriptors and the Spearman rank correlation as a distance measure. Here, we denote the correlation matrix between descriptors as S. Then, the distance between a pair of descriptors is defined by.
In order to choose a clustering method we use the cophenetic correlation measure [24]. A high correlation coefficient shows that the distance between the data points is well preserved by the created dendogram of the hierarchical clustering solution. In our analysis, the cophenetic correlation coefficient is highest for the average clustering solution for all three data-set compared to other clustering algorithms. We calculate the cophenetic correlation for seven hierarchical clustering algorithms which are the Ward, Single, Complete, Average, Mcquitty, Median and the Centroidmethod. The cophentic correlation coefficients for the average clustering solutions for three data-sets are 0.84, 0.89 and 0.93.

Cluster Validity
Cluster validity [23,25] is used to evaluate the quality of clustering solution (by using a certain clustering algorithm), e.g., the optimum number of clusters in the data, or whether the resulting cluster solution fits the data. Known clustering validation techniques are divided into three categories namely internal, external and relative validity criteria. External validation criteria evaluate clustering solutions with a predefined clustering structure. Using internal validation criteria relates to find the optimal number of clusters which is based on the intrinsic knowledge of data. Relative validation criteria are used to compare two different clustering solutions [23].
In order to perform analyses, we use external and internal clustering validation criteria. For the external validation, we compared the clustering solution with a predefined group of clusters which serve as reference clusters. The external clustering validity of a clustering solution with respect to the given reference cluster is estimated by using the information-theoretic quantity NMI max (normalized mutual information) [26,27] defined by where Hereby, we assume that we have two clustering solutions U and V which have R and C clusters. The overlap between these two clusters is shown in the contingency Table 1. We calculated NMI max for all three data-sets with different number of clusters.

The Optimal Number of Clusters
The optimal number of clusters (internal cluster validity) are determined by consensus clustering [27,28] which has been here performed as follows. Assume we evaluate N descriptors on a dataset containing n molecular graphs. Thus we get n descriptor values for each descriptor. First, we resample the data of samplesize, pvn, B~100 times for N descriptors to generate B clustering solutions U k~f U 1 k ,U 1 k . . . U B k g, for k clusters, where k~2,3, . . . ,200. After that we calculate the consensus indices for each  cluster, k, which is defined as follows: As to the measure AM, we use the adjusted rand index ARI [29] defined by.
The number of clusters k for which CI attains its maximum is chosen as the optimal number of clusters, namely.

Determining a Highly Correlated Subset of Descriptors
Let D be a set of descriptors and jDj is its cardinality. Let S be a subset of D. The selected jDj~919 descriptors can be reduced to a set of descriptors, S(D. The remaining jDj{jSj descriptors will have a significant correlation with at least one of the descriptor in the set S and the descriptors in S are not significantly correlated. If two descriptors are showing a significant correlation with each other, then we conclude that they capture structural information similarly. In order to predict the significance of the correlation between two descriptors, we perform the following approach: Let M be a dataset of N descriptors and n samples. First, we generate bootstrap datasets, M k , k . . . B~500 possessing sample size p~200, where pvn. Then, for each dataset, M k , we perform a correlation test [30,31] between each pair of descriptors and obtained a p value p ij for each pair. Thus, we test N(N{1)=2 hypotheses for all pairs. In order to control the false positives in the multiple hypothesis testing problem, we use the bonferroni correction method for multiple testing correction (MTC) [32] and obtained adjusted p-values. For each pair these adjusted p-values are denoted by q ij . In order to decide whether the correlation between a pair is significant, we choose a~0:00001. After applying the correlation test and MTC, we obtain a binary matrix I Mk which is defined follows: Finally we calculate a summary-statistic, T(i,j), for each pair of descriptors by averaging the values, i.e., : In order to decide whether the correlation between two descriptors is strong, we choose a cut-off threshold a sum~0 :99. If for the summary-statistic between two descriptors holds the inequality T(i,j)w~a sum , then we define two descriptors to be strongly correlated with each other. The descriptors in the set S Table 2. The optimal number of clusters for the three data-sets obtained by using consensus indices (CI).

Data-set CI # of clusters (jPj) # Descriptors in each cluster
MS 2265 0.942 5 jc 1 j~863, jc 2 j~22, jc 3 j~18, jc 4 j~1, jc 5 j~15 c 15 0.9878 16 jc 1 j~764, jc 2 j~32, jc 3 j~12, jc 4 j~26, jc 5 j~2, jc 6 j~10, jc 7 j~9, jc 8 j~6, jc 9 j~6, jc 10 j~1, jc 11 j~1, jc 12 j~1, jc 13 j~2, jc 14 j~6, jc 15 j~24, jc 16 j~17 N 8 1.00 7 jc 1 j~834, jc 2 j~3, jc 3 j~12, jc 4 j~26, jc 5 j~27, jc 6 j~14, jc 7 j~3 The optimal numbers of clusters (for three data-sets) for a clustering solution P is represented by the set P~fc 1 ,c 2 , . . . c jpj g, where jPj is the optimal number of clusters in the data. doi:10.1371/journal.pone.0083956.t002 Table 3. The descriptors in predicted clusters (rows) overlapping with different categories of descriptors. Then we remove the descriptor D i and the other descriptors with which D i has summary-statistic §a sum . Then, we apply the same procedure to the remaining descriptors until we find any descriptor having maximum number of summary-statistics with remaining descriptors §a sum . Note that some of the descriptors do not have any summary-statistic greater than §a sum with any of the other descriptors. These descriptors are described as lowly correlated descriptors and such descriptors are also included in the subset S. This procedure reduces jDj descriptors to jSj descriptors. That means starting with a set of D descriptors, we hypothesize that the set S identify structural properties of a graph class distinctly. The remaining jDj{jSj descriptors are showing stronger similarity (correlation) with at least one of the descriptor of set S.

Interpretation of the Results
The clustering of descriptors for three datasets is shown by Figure 1. In this figure, the six categories of descriptors are shown in different colors. The figure indicates that the descriptors of each categories have not been clustered correctly regarding their respective groups. For the external validity of the resulting clustering solution, we estimated NMI max (normalized mutual information) [26] between reference cluster, RC~fc 1 ,c 2 ,c 3 ,c 4 ,c 5 , c 6 g (the descriptors of six categories, jRCj~6, and fjc 1 j2 4,jc 2 j~301,jc 3 j~57,jc 4 j~28,jc 5 j~40,jc 6 j~469g are considered as the groups of the reference cluster) and the number of clusters of the clustering solution by cutting at different heights. The estimated normalized mutual information is calculated by sampling the data B~200 times. Results for the three data-sets (average NMI) are shown in Figure 2. The average normalized mutual information plot between the reference cluster and the clusters created by performing average hierarchical clustering shows that they are quite dissimilar, that is the predicted clusters and the reference cluster are not similar at all. Also, the descriptors of different categories are strongly correlated with each other.
Next, we predict the optimal number of clusters, P~fc 1 ,c 2 , . . . c jPj g by using consensus indices measure for different number of clusters generated by a clustering solution. The plots for the consensus indices for the three data sets are shown in Figure 3. The consensus indices are calculated for k~2, . . . ,200 clusters. CI for different number of clusters for the three data-sets does not show an absolute maximum. Therefore we selected the first local maxima which gives the optimal number of clusters. The optimal number of clusters are shown with a dotted red line in the Figure 3. The consensus indices (CI) for the optimal number of clusters (jPj) and the total number of descriptors (jc i j, where i~1, . . . ,jPj) in each cluster for the three data-sets, MS 2265 , C 15 and N 8 are shown in Table 2. The optimal number of clusters are very little for all three data-sets and for all data-sets. The first cluster is the largest one which contains more than 80% of 919 descriptors. The cardinalities of the remaining clusters are smaller as they contain much less descriptors. The largest cluster for all three datasets contains descriptors from all six categories which means that most of the descriptors from different categories have a strong correlation among the descriptors and, therefore, they measure structural information similarly.
As a next step, we examine the so-called overlap between the optimal number of clusters shown in Table 2 and the six categories of descriptors. That means we have to determine how many different descriptors are distributed over different groups (belonging to the optimal number of clusters). This distribution over different clusters could give some information namely which category might capture structural information of the graphs more uniquely than others. The results are shown in Table 3 and we are going to interpret these results as follows. The intersection of the descriptors between the optimal clusters and the categories of descriptors show that the edge adjacency indices are grouped into different cluster for all three data-sets in comparison to the remaining categories. The 2D Matrix-based descriptors are grouped into different clusters by using C 15 and N 8 . The information indices are grouped into two different clusters by using all three data-sets. The measures from the category walk path counts and topological indices are grouped into different clusters by using C 15 only. This shows that these descriptors behave differently on trees. The overlap indicates that the group of edge adjacency indices contains more descriptors which capture structural information of the graphs differently compared to other categories.
Next, we find a subset of descriptors S(D, jDj~919. The main idea is to find a smaller set of descriptors which are little correlated with each and, hence, those graph measures captures structural information uniquely. If they would be strongly correlated, they would capture similar structural information of the graphs. Importantly, the remaining descriptors have much stronger correlation with them. The procedure to obtain a subset of descriptors S(D is described in the section 'Methods and Results'. We obtained jSj~f19,22,18g for MS 2265 ,C 15 ,N 8 datasets shown in Table 4. The levelplot of correlation for the subset of descriptors of three data-sets are shown in Figure 4. For all three data-sets, we can clearly see that the descriptors of these subsets are not strongly correlated. These subset of descriptors for all three data-set might detect structural features of the molecular graphs uniquely. Moreover we now examine for all data-sets which descriptors from S (shown in Table 4) belong to which group out of the six categories of descriptors. The results are summarized in Table 5. For each data-set, we start with a different number of descriptors for the different categories. The subset S does not contain any descriptor from the connectivity indices for all three data-sets, however, only two descriptors from walk path counts are contained in S by using C 15 . Two, four and three descriptors from the category topological indices are contained in S for all three data-sets. Three, two and three descriptors from the category information indices are in S for three data-sets. Seven, three and three descriptors from the category 2D Matrix-based are in S for three data-sets. Seven, eleven and seven descriptors from the category edge adjacency indices are in S for MS 2265 , C 15 , N 8 . These are the maximal numbers of descriptors compared to other categories of descriptors. The large occurrence of the descriptors from the category edge adjacency indices shows again that these descriptors quantify structural information more uniquely than others. Table 4. Given the subset S; then, the remaining jDj{jSj descriptors have at least one pair for which the summary statistic T(i,j) is greater than a sum~0 :99 with jSj descriptors.  Also, we examine the overlap between the descriptors from S and the descriptors in the found clusters; the intersections between them are shown in Table 6. Interestingly, at least one descriptor (for all data-sets) overlap with the descriptors of each cluster, except for the ninth cluster by using C 15 . The overlap with the found clusters show that the measures contained in S (for three data-sets) have the potential to quantify unique structural features of molecular graphs.

Summary and Conclusions
In this paper, we have evaluated 919 Dragon descriptors to investigate to what extent these measures quantify structural information of molecular graphs uniquely. From our analysis, it is clear that measures which are strongly correlated are not useful as they capture structural information similarly. From this, the question of determining the usefulness or quality of topological indices arises.
We found by calculating the information-theoretic quantity NMI that the used six categories of descriptors are strongly correlated with other categories of descriptors. This indicates that despite being categorized into different groups, these descriptors are providing similar information. From this, one can conclude that many of them they have been introduced in an unconsidered manner. Again, the question how useful such indices are seems to be quite important and deserves further attention.
By using all three data sets, the most suitable descriptor subset S contains those measures which have the largest number of significant correlations with the remaining descriptors but they are not significantly correlated with each other. S forms a reduced set of descriptors (the original sets contains 919 descriptors) and their sizes are feasible approximations of the effective dimension of the descriptor space by using all three datasets. For each individual data set, we found the size of S to be 19 (MS 2265 dataset), 18 (N 8 dataset) and 22 (C 15 dataset). Because most of the descriptors we have used are redundant, i.e., they are highly correlated, the estimation of the effective dimension is an intriguing problem. In Table 6. The overlap between S and the predicted clusters (rows).

Number of cluster
Descriptors of S our context, the dimension is the number of different descriptors among all. By performing our analysis, we obtained a lower bound on the dimension of descriptors space regarding the different classes. Note that these descriptors (the ones in S) depend on the used data set. By inspecting these subsets, we see that the majority thereof are from the category of the edge-adjacency indices. This implies that the edge-adjacency based descriptors can capture more structural diversity when quantifying structural properties of molecular graphs. As another result of this paper, we see that it would not be appropriate to select descriptors more or less randomly for QSAR problems. Neither the random selection nor using all available descriptors would be appropriate as demonstrated in our paper. To tackle this problem, we suggested a statistical analysis evidenced by using clustering. Again, we note that our method applied to six categories of descriptors reduces the descriptor space for three datasets. In this paper we have presented a statistical approach by using correlation test to select a smaller subset of descriptors which captures information similarly. By employing bootstrapping and a probabilistic measure for the selection process, we have identified the most informative set of descriptors. As seen, a set of descriptors can cover a dataset best, but studying this important issue in depth might be future work.

Author Contributions
Analyzed the data: ST FES MD. Wrote the paper: MD FES ST.