The Discrimination Power of Structural SuperIndices

In this paper, we evaluate the discrimination power of structural superindices. Superindices for graphs represent measures composed of other structural indices. In particular, we compare the discrimination power of the superindices with those of individual graph descriptors. In addition, we perform a statistical analysis to generalize our findings to large graphs.


Introduction
The absence of a polynomial time algorithm for determining if two arbitrary graphs are isomorphic has stimulated efforts to develop efficient heuristics that work in almost all cases. In particular, research on structural network measures has been undertaken in recent decades, see, e.g., [1][2][3][4][5][6]. Several different types of network measures have been developed. Some of them have been used to characterize the structure of graphs locally or globally [2][3][4][5][6][7]. Others have been used to characterize graphs quantitatively, and these have been applied to problems in areas such as structural chemistry, structural drug design, ecology, and computational physics [2,[8][9][10]. Bonchev [11] and Balaban et al. [12] developed structural indices to detect branching in molecular graphs. In addition to research directed at measuring structural features of a given network, work has been carried out on comparative network measures [13][14][15][16]. Examples include such work as graph similarity and graph distance measures which have been applied to graph clustering and other problems, see [17][18][19].
Properties of structural measures have also been examined in some detail. Research in this area encompasses investigations of the mathematical interrelations between network measures [20,21], correlations between measures [22,23], and their respective discrimination powers (also called uniqueness) [24][25][26][27][28][29]. Discrimination power (or the uniqueness property) is the central concern of this paper. In addition to earlier work on the uniqueness of structural graph measures [24][25][26][27]29,30], Dehmer et al. [28,31] recently performed large scale analyses of the uniqueness of information-theoretic, degree-based and eigenvaluebased network measures. Here we focus on single indices defined relative to graph decompositions such as those induced by symmetry structure, distances, vertices, chromatic features, etc. Such an index is a mapping I : G? R graph complexity measure [2,5,9]. Single indices interpreted as graph invariants [6] have been studied in areas such as structural chemistry [3,32] and computer science [33]. Also, we emphasize that approaches employing single indices for finding complete graph invariants have failed so far [32,34,35]. A complete graph invariant is an index that distinguishes between non-isomorphic graphs in a given collection. The reason for their failure is that every known single index has a certain degree of degeneracy [25,35], that is, the measure can not distinguish non-isomorphic graphs by its values. Hence, single structural indices are not suitable for determining graph isomorphism, see [35].
In this paper, we explore the uniqueness of so-called superindices [25,36,37] for graphs (see section 'SuperIndices'). Such superindices have been studied in structural chemistry and other disciplines [25,36,37]. A superindex is a composition of several structural index components, and is designed to obtain a measure which captures structural information more meaningfully than the individual components by themselves. To the best of our knowledge, the uniqueness of superindices [25] has not yet been explored to any great extent. To this end we use exhaustively generated general graphs [28] rather than any special graph classes such as chemical graphs [26,29,30]. The reason for using exhaustively generated general graphs (i.e., graphs without any structural constraints [28]) is to study the uniqueness of the superindices applied to arbitrary graphs. In short, the problem we address is the use of structural superindices that appear useful in determining graph isomorphism. Superindices are not restricted to any particular class of graphs -they can be applied to arbitrary graphs. Furthermore, a graph index is a measure that maps a single graph to the reals. In contrast, a graph metric [14,38,39] is a comparative measure designed to determine the structural similarity between graphs. Those metrics will not be used in this paper. Other graph measures such as the clustering coefficient or degree-based measures do not quantify structural features of graphs meaningfully as they exhibit a high degree of degeneracy [40].

SuperIndices
Superindices [25,36,37] are combinations of existing indices, where ''combination'' means algebraic or transcendental operations on the component indices. The term superindex was coined by Bonchev et al. [25] who devised superindices to achieve better discrimination between isomers than was possible using individual graph measures. Dehmer et al. [36] applied information-theoretic superindices to the Ames benchmark dataset of Hansen et al. [41] using supervised machine learning. In addition, Pogliani [37] derived certain superindices and demonstrated their power to predict melting points.
Let G be a graph class and I : G? R z a topological index (or descriptor). Given I 1 and I 2 we define the following superindices, chosen because they are the simplest and most obvious linear combinations of two indices, and turn out to have high discrimination power, and, after all, this is the acid test of the utility of the indices. It is of course possible that other combination methods, based for example on rank reduction techniques such as Singular Value Decomposition, would produce indices with even greater discrimination power. However, that is something to be explored in future papers. We define: (I 1 ,I 2 ).I 1 : (I 1 ,I 2 ).
(I 1 ,I 2 ).e ffiffiffiffiffiffiffi ffi Balaban et al. [12] proposed similar superindices in QSAR/ QSPR [12,42]. That selection proved quite useful and has influenced our choice of superindices for the current study of uniqueness. In the following sections, we analyze the discrimination power of these superindices numerically and statistically. In particular, we demonstrate that some superindices far outperform the underlying single descriptors.

Data and Computation
The uniqueness of the superindices listed above has been analyzed on a collection of exhaustively generated graphs [28]. This collection, denoted N 9 (with DN 9 D~261080) [28], consists of all non-isomorphic connected graphs on 9 vertices. As in [28], the graphs in this collection were generated by the program geng from the Nauty package [43]. The individual as well as the superindices were calculated with the aid to the R-package QuACN [44,45].
The random graph construction model was selected because it yields the most general class of graphs, and seems appropriate for an initial study of the discrimination power of superindices. Other construction methods, e.g., [46] are also of interest, especially because they model many real world graphs known to exhibit a power law distribution. However, application of the superindices to graphs produced by other construction methods is beyond the scope of the current paper. Table 1 presents the QuACN-descriptors [44] with their input options (parameter) and their abbreviations. Superindices with components drawn from the descriptors in Table 1 have been calculated. The results of these computations (discussed below) are shown by Tables 5,6,7,8,9,10,11,12. Table 4 shows the uniqueness of QuACN-descriptors for given ndv-values, i.e., the number of the non-distinguishable values (graphs) for a particular index and sensitivity
Most of the so-called molecular ID numbers (such as minBalaba-nID) appear to be highly discriminating but have never been evaluated on general graph classes such as exhaustively generated general graphs. It has also been observed that the uniqueness of structural graph indices depend on the graph class under consideration, see [28,31,50]. Tables 5, 6, 7, 8 present the uniqueness results for certain combinations of descriptors involving the superindices. Each pair of tables shows the the results for two subsets of such indices. The first subset consists of Equations 1-5 (e.g., Table 5) and the second subset consists of Equations 6-9 (e.g., Table 6), respectively. For instance if we look at Table 5, we see that most of the superindices now discriminate the graphs perfectly (ndv = 0) even when indices with very low uniqueness (such as augmentedZagreb, bertz, wiener etc.) are involved. When applying the descriptors radialCentric and eigenvalaugement to the Equations representing the superindices, some of them are much less discriminating (ndv = 79676 corresponds to S~0:694833). This is due to the fact that radialCentric has little discrimination power (it discriminate only two graphs out of 261080). A similar effect can be seen in Tables 9,  Figure 3. The means of the sensitivity values Svs. the total number of randomly generated graphs (DVD~150) using the superindex I~I 1 zI 2 . To calculate the superindex, we used all combinations of eigenvalue-based descriptors (Left) and eigenvalue-based and informationtheoretic descriptors (Right), see Table 3. doi:10.1371/journal.pone.0070551.g003 10, 11, 12. For instance, Table 9 shows that the composition (based on the superindices) of a descriptor with little discrimination power (e.g., narumiKatayama; ndv = 260925, S~0.00059, see Table 4) with another descriptor having high discrimination power (e.g., eigenvalvertconnect; ndv = 1089, S~0.99583, see Table 4) leads again to a highly unique measure. In this particular case and by using the superindex I 1 zI 2 , we find its discrimination power to be ndv = 535 and S~0:99801. Uniqueness (measured by ndv and S) of the new measure is better than the uniqueness of the component measures, see Table 9. More extreme cases can be found in Table 12 defined as the composition of the two descriptors topologicalinfocontent and eigenvalvertconnect using the superindex e ffiffiffiffiffiffi ffi I1 : I2 p . In short, Tables 5,6,7,8,9,10,11,12 demonstrate that most of the superindices possess high uniqueness when one of the constituent graph measures has little discrimination power.
To better understand the behavior of these indices it would be desirable to explore the structural interpretation of these measures. Many of the constituent measures have a structural interpretation associated with a branching index [11,22] (e.g., the Wiener index . The means of the sensitivity values Svs. the total number of randomly generated graphs (DVD~150) using the superindex I~I 1 zI 2 . To calculate the superindex, we used all combinations of eigenvalue-based and distance-based descriptors (Left) and eigenvalue-based and degree-based descriptors (Right), see Table 3. doi:10.1371/journal.pone.0070551.g004 Figure 5. The means of the sensitivity values S vs. the total number of randomly generated graphs (DVD~150) using the superindex I~I 1 zI 2 . To calculate the superindex, we used all combinations of distance-based descriptors (Left) and distance-based and degree-based descriptors (Right), see Table 3. doi:10.1371/journal.pone.0070551.g005 (wiener) or as a cyclicity index [12] (e.g., the Balaban index (balabanJ). A correlation analysis might be used to determine classes of superindices having a distinctive interpretation, e.g., branching, cyclicity, irregularity etc. Such an analysis would involve finding the correlations between I 1 zI 2 and I 1 , I 1 : I 2 and I 1 , I 1 I 2 and I 1 , etc. However, this is beyond the scope of the present paper.

Statistical Analysis
To determine the scalability of our findings on discrimination power of superindices applied to the graphs in N 9 , we have performed a statistical analysis. The aim of this analysis is to determine whether or not the results for determining uniqueness are statistically stable for graphs with larger numbers of vertices. Central to this analysis is a method for generating random graphs. We used Bootstrapping [51,52] to estimate the underlying sampling distribution.
Let G~(V ,E) be a graph with DV D vertices and DED edges. Now, the size DED of the edge set of a connected random graph with DV D vertices satisfies.
For the statistical analysis see Figures 1 and 2. Samples of random (Erdös-Rényi) graphs have been generated using the Rlibrary igraph [53] for DV D~50,75,100,150. More precisely, we have generated 50 random graphs for each of the edge sizes 3. Check each generated random graph for isomorphism with previously generated graphs. If the newly generated graph is not isomorphic to any of the previously generated graphs, we add this graph to the list, and return to step 1.
Performing the computation in Algorithm 1, we obtain complete random samples for DV D~50,75,100,150. For the sake of completeness, we also give the sizes of the random samples generated: 225. By choosing j~7, we generated random graphs with 49ƒDEDƒ1218. Hence, we obtain 58500 random graphs in total.
In order to calculate the superindices, we computed all possible (pairwise) combinations of the descriptors given in Table 2. To calculate the mean sensitivity S for each descriptor combination, we bootstrapped the samples 200-times without replacement. Finally, the mean values of all sensitivity values for superindices I~I 1 zI 2 and I~ffi ffiffiffi I 1 p z ffiffiffiffi I 2 p together with their variances are shown by Figures 1 and 2. The mean values are quite stable. Thus, there is little dependency between the mean sensitivity and the number of vertices of the generated random graphs. In particular, we see that the mean value detoriates slightly for DV D~150. In short, Figure 1 strongly supports the hypothesis that the computed superindices have high discrimination power for graphs of increasing size and the values are quite stable. Indeed, stability could be defined here by the degree of the dependency between the mean sensitivity values and the number of vertices. Note that the analysis whose results are shown in Figure 1 was computationally demanding due to the combinatorial explosion of cases. Hence, to repeat the analysis for much larger (i.e., DV D&150)) may not be feasible.
In contrast to the superindices, the results in Figure 2 show that the discrimination power of the individual descriptors listed in Table 2 is worse for larger graphs. This is indicated by the mean sensitivity values which are much lower than the ones shown in Figure 1. This demonstrates that superindices I~I 1 zI 2 and I~ffi ffiffiffi I 1 p z ffiffiffiffi I 2 p have a much better discrimination power on the generated random graphs. A reason for this is that the superindices seem to capture structural information more meaningfully than the individual ones. This seems to be clear (for the used graph class) as multiple descriptors capture several different aspects of structural information which may complement each other and, thus, provide a (super) index with improved discrimination power.
The results in Figures 1 and 2 summarize the uniqueness of some superindices as a function of the size of randomly generated graphs. We next consider the relationship between uniqueness (measured by S) and graph size. The results are shown in Figure 3, 4, 5. Earlier work by Dehmer et al. [28] on superindices restricted the component individual indices to information-theoretic measures. In the present study, we aim to examine the dependency between the uniqueness of the superindex I 1 zI 2 using certain descriptor categories applied to generated random graphs of fixed size (DV D~150). The categories included eigenvalue-based, information-theoretic, distance-based and degree-based descriptors. The descriptors in the categories are listed in Table 3. In order to calculate the mean sensitivity using the descriptors of the above mentioned categories, we bootstrapped the descriptor values 200 times without replacement for each combination to determine I 1 zI 2 of randomly generated graphs (DV D~150). The sample sizes are 100, 1000, 10000, 100000, 900000. Figures 3, 4, 5 shows the impact of the underlying category on the above mentioned dependency. From Figure 3 we see that there is nearly no dependency between S and the sample size. A plausible reason for this is the high uniqueness of the underlying individual descriptors of the categories employed, namely, (left) eigenvalue-based descriptors and (right) eigenvalue-based and information-theoretic descriptors (see Table 4). Figure 4 shows a similar result but there is a slight detoriation of uniqueness for the degree-based descriptors used calculate the superindex. This seems plausible as many degree-based measures possess little discrimination power, e.g., see [31]. The left hand side of Figure 5 shows the dependency plot by using the (pure) category of distance-based measures (see Table 3). In particular, the variances are very high and the mean sensitivity values detoriate substantially as the sample size increases. Again, this can be understood by the low     Table 6. ndv-values of graphs in N 9 for different combinations of QuACN-descriptors from the second subset of superindices.
Descriptors ; \ ndv ? ffiffiffiffi  Table 7. ndv-values of graphs in N 9 for different combinations of QuACN-descriptors from the first set of superindices (continued).
Descriptors ; \ ndv ? I 1 zI 2 I 1 : I 2 I 1 =I 2 1=(I 1 :     uniqueness of various distance-based graph measures (see Table 4). The right hand side of Figure 5 shows that this effect is eased for a (mixed) category of descriptors -distance-based and degree-based descriptors in the present case. In summary, we see that the uniqueness of the superindex does not depend much on the sample size when the component descriptors are relatively unique. In our study, this applies to the eigenvalue-based and informationtheoretic descriptors. It is not surprising that we obtained very similar results by using the superindex ffiffiffiffi I 1 p z ffiffiffiffi I 2 p .

Summary and Conclusion
In the foregoing we examined the discrimination power of structural superindices composed of two or more individual measures (or descriptors) defined on graphs. Our results show that superindices generally have greater discrimination power than individual descriptors. The initial analysis of the superindices was performed the collection of graphs on nine vertices. In addition, we examined the relative performance of superindices on randomly generated connected graphs on 50, 75, 100, and 150 vertices, respectively. The findings show that the superindices perform consistently over these different sized graphs, whereas individual descriptors exhibit declining performance. We conjecture that this superior performance of superindices is attributable to their taking account of multiple structural features of a graph, rather than the single feature captured by individual descriptors. Further research is needed to account for the differences in performance between different superindices, and between superindices and individual descriptors.