Information Indices with High Discriminative Power for Graphs

In this paper, we evaluate the uniqueness of several information-theoretic measures for graphs based on so-called information functionals and compare the results with other information indices and non-information-theoretic measures such as the well-known Balaban index. We show that, by employing an information functional based on degree-degree associations, the resulting information index outperforms the Balaban index tremendously. These results have been obtained by using nearly 12 million exhaustively generated, non-isomorphic and unweighted graphs. Also, we obtain deeper insights on these and other topological descriptors when exploring their uniqueness by using exhaustively generated sets of alkane trees representing connected and acyclic graphs in which the degree of a vertex is at most four.


Introduction
To quantify the topology of networks, numerous topological descriptors, which are also often referred to as graph measures or indices, have been developed [1][2][3][4][5][6][7]. A property thereof called the uniqueness, discriminative power or degeneracy has been investigated extensively in mathematical chemistry and structure-oriented drug design in the context of characterizing the structure of molecules quantitatively. In general, a descriptor is called degenerate if it possesses the same value for more than one graph. In this paper our main task is to examine the extent to which topological indices are degenerate.
We briefly review the most important contributions to tackle this problem, and start with a classical contribution due to Bonchev et al. [8,9]. They proposed the so-called magnitudebased information indices for improving the discriminative power of other classical descriptors for alkane trees [8] and isomers [9]. Alkane trees are connected and acyclic graphs in which the degree of a vertex is at most four [10]. Following this, Raychaudhri et al. [11] analyzed the discriminative power of information-theoretic measures based on distances for chemical graphs containing one ring. Konstantinova et al. [12] explored the uniqueness of various information-theoretic and non-information-theoretic measures by using polycyclic structures representing cata-condensed benzenoid hydrocarbons. As a result, the Balaban J index (see equation 20), the sum of local vertex entropies due to Konstantinova [12,13] and the magnitude-based information indices turned out to be unique for this class of graphs; see [12]. However, note that the sizes of the corresponding sets C i , denoted by jC i j, were rather small, 2ƒjC i jƒ1681. Diudea et al. [14] recently explored a novel super-index based on shell matrices and polynomials. By applying this index to the heterogeneous graph database MS2265 [15] containing 2265 non-isomorphic skeleton graphs, inferred from chemical compounds, and to chemical isomers, it turned out that this index does not have any degeneracy [14]. Other results obtained when applying further topological descriptors to chemical graph databases can be also found in [14]. Hu and Xu [16] applied an index using layer matrices and powers of extended adjacency matrices to over two million weighted alkane isomers. The index was unique for all graph classes used [16], but we point out that the developed index is based on using bond types and 3D information.
In order to underpin the practical importance of exploring uniqueness, it seems reasonable that an appropriate graph measure to characterize the structure of networks quantitatively should be able to discriminate graphs properly (e.g., when slightly changing the structure of a network). Note that this problem has already been discussed in the context of complex networks; see [17]. As to applications thereof, Dehmer et al. [15] have already outlined that unique measures can serve as candidates for calculating the identification codes of networks (e.g., chemical structures), which could be used to perform fast structure searches in large databases. Also, such highly discriminating measures representing graph invariants (the measured value is invariant under graph isomorphisms [10]) can be useful to tackle the graph isomorphism problem, because, if the values of two graphs with the same number of vertices are different, they must be non-isomorphic. Hence, such indices could be employed to tackle the graph isomorphism problem in large databases, as the computational complexity of the measures is polynomial. That means instead of performing a thorough isomorphism test which may be computationally costly, highly unique graph measures could be used to filter out non-isomorphic graphs. Note that the time complexity of some of these measures has already been discussed in [15].
The main contribution of this paper is to evaluate the discriminative power of selected topological indices in the context of complex networks, i.e., graphs that are neither regular nor random [18]. We applied several information-theoretic and noninformation-theoretic measures, such as the Balaban J index [19], to nearly 12 million exhaustively generated, non-isomorphic and unweighted graphs with the same number of vertices (see 'Numerical results and interpretation'). Importantly, we only use unweighted graphs in this study, as it poses an extra challenge to the underlying descriptors to discriminate such graphs on a large scale. We emphasize that the Balaban J index has often been referred to as one of the most discriminative indices (see e.g. [20]), as it is powerful when applied to several classes of isomers and alkane trees. Our study highlights the limitations of the Balaban J index and other topological descriptors in terms of their ability to discriminate non-isomorphic graphs uniquely.
We prove that one of the information indices due to Dehmer et al. [15,21], which uses the information functional f D based on degree-degree associations, outperforms the Balaban J index tremendously when these measures are applied to exhaustively generated graphs. We also employ other information measures for graphs using so-called information functionals that have been developed by Dehmer et al. [15,21]. The discriminative power of some of these information measures and classical ones has already been evaluated in [22] specifically for chemical graphs possessing structural constraints. By contrast, we perform a large-scale study to compare the discriminative power of these information measures by employing three information functionals (see equations 7, 8, and 18) and non-information-theoretic indices such as the Balaban J index using exhaustively generated graphs without structural constraints. The discriminative power by employing these particular information functionals and Balaban J index has not yet been investigated on a large scale.
The results can be interpreted as an attempt to evaluate the uniqueness of quantitative graph measures in the context of complex networks. To the best of our knowledge, very little work has so far been done to tackle this problem. One exception is the work of Kim et al. [17], who evaluated the discriminative power of graph complexity measures that were developed in the context of network physics. As a result, most of the complexity measures proposed in [17] turned out to show little discriminative power. This paper is organized as follows. In the section 'Topological descriptors' we briefly recall the definitions of the informationtheoretic measures due to Dehmer et al. and the other graph measures that we are going to use. The 'Data and software' section describes the datasets and sketches the steps to calculate the topological descriptors. In 'Numerical results and Interpretation', we present and interpret the numerical results when evaluating the discriminative power of the measures. This includes a statistical analysis to investigate the dependence of the uniqueness of the Balaban J index and I l f D on the sample size by using exhaustively generated graphs with 10 vertices. The paper finishes with a 'Summary and conclusion'.

Topological Descriptors
In this section, we briefly recall the definition of the information measures [4,15,21] that we are going to use in this study. Further, we outline the concept of distance-based descriptors, including the well-known Balaban J index. In summary, Table 1 gives an overview of the descriptors that we use.
Information Indices. To start, we point out that, besides empirical properties of information measures for graphs [1,4,15,21] (such as determining correlations between the measures [1]), mathematical problems (such as proving various upper and lower bounds of the measures) have also been explored; see [23,24]. Note that the correlation ability between two graph measures generally relates to the problem of whether they capture structural information similarly [1,9]. The so-called implicit information inequalities have been investigated extensively in [21,25,26]. Also, the class of graph entropy measures obtained by using certain information functionals based on the metric properties of graphs (such as the neighborhoods of atoms) has been used to solve problems in quantitative structure-activity relationships (QSARs) and quantitative structure-property relationships (QSPRs) [27]. In particular, Dehmer et al. [28] classified the mutagenicity of molecules by using these measures and employing supervised learning techniques.
Let G~(V ,E) be an arbitrary, finite, and unweighted graph; jV j denotes the number of vertices and jEj the number of edges, respectively. Throughout this paper, we use the symbol jAj to express the cardinality (also called the size) of a set A. We denote by r(G) the diameter of G; see [29]. The abstract information functionals [21] f : V ? z play a critical role when defining information measures on graphs. Based on these functionals, vertex have been assigned to each particular vertex of G. This makes the resulting measure independent of determining partitions of graph invariants [1,8,30,31], which might be computationally difficult to obtain. By definition, and (p f (v 1 ), . . . ,p f (v jV j )) therefore forms a probability distribution. Using this approach and recalling Shannon's entropy [32] defined by the families of information measures have been developed [4,15,21]. These measures are families of entropic measures representing the structural information content of G. Here lw0 is a scaling constant, I f is the mean entropy of G, and I l f its information distance between maximum entropy and I f . In our analysis, we define three distinct functionals f V , f P , and f D , and the relative information measures I l f V , I l f P , and I l f D [4,5,21]. To define f V , we first define the j-sphere of a vertex v i [ V by [21] [33]. Then, To define f P , the pathlengths for j~1,2, . . . ,r(G) of the local information graph L G (v i ,j) starting from a particular vertex have been used; see [21] for its detailed definition. For example, P(L G (v i ,j)) is the sum of all pathlengths starting from v i [ V by inducing shortest paths for j~1,2, . . . ,r(G). We obtain Finally, we define f D (see [34]), let G be an undirected and unweighted graph, and set we define the sets of shortest paths [34] . . .
and the corresponding degree sequences [34] s j . . .
have been used to define the information functional f D ; see equation 18. As we employ the differences jd(v){d(u)j, the resulting graph entropies I f D and I l f D have been called degree-degree association indices; see [34]. Now, f D has been defined by [34] We see that f D is well defined for any aw0. Since f V , f P and f D as well as the resulting entropies are parametric, we need to choose the coefficients c i for weighting the structural differences or characteristics of a graph. Note that the c k must be chosen such that at least two coefficients c i ,c j are distinct. This includes the parameter settings, e.g., which have already been used in [15]. Other configurations of the c i have also been investigated to determine the structural complexity of chemical structures meaningfully [15]. Distance-Based Topological Descriptors. Numerous topological descriptors have been explored by employing distances in a graph [7,19,29]. Seminal work was done by Skorobogatov and Dobrynin [29], who developed a theory on the metric properties of graphs. Also, several distance-based graph measures have been developed and analyzed where these indices have shown that distances in graphs capture significant information when applied in QSAR/QSPR; see [1,7,11,19,27].
We recall the definition of the Balaban J index [7,19] in detail as we place emphasis on comparing its discriminative power with I l f V , I l f P , and I l f D on a large scale by using exhaustively generated graphs. The names and symbols of the remaining descriptors used in this study can be found in Table 1. For their formal definitions, see [1,2,7,27]. Now, we define the distance matrix [35] of a graph G as DS i denotes the distance sum (row or column sum) obtained by adding the entries in the corresponding row or column of the distance matrix DS. In addition, m:~jEjz1{jV j is the cyclomatic number [36]. Then, the Balaban J index is defined by [19] J(G):~j Ej mz1

Data and Software
Let us now state the definitions and generation procedure of the graphs for performing our analysis.
Definition 1 N i is the set of all exhaustively generated non-isomorphic and connected graphs with i vertices.
Definition 2 C i is the set of all exhaustively generated non-isomorphic alkane trees graphs with i vertices.
Then for both classes (see Definitions 1 and 2), the structure information has been converted into the graphNEL format to calculate the descriptors in R [40] by employing the QuACN package [41]. This package contains R functions of over a hundred topological descriptors.

Numerical Results and Interpretation
In this section, we present the numerical results when evaluating the discriminative power of the information indices, Balaban J index and other topological descriptors. Results on exhaustively generated graphs are summarized in Tables 2 and 3, while those  on alkane trees are given in Table 5. In total, we evaluated the discriminative power of 27 graph measures.
Evaluation of the Discriminative Power Using Exhaustively Generated Graphs. To interpret the numerical results, we start by considering Table 3 and observe that the sensitivity values due to Konstantinova [12], S~(jGj{ndv)=jGj, for Balaban J decreases with increasing number of vertices; see also the 'Statistical analysis' section. Throughout this paper, ndv (non-distinguishable values) stands for the number of nonisomorphic graphs whose values cannot be distinguished by a particular index [12]. For example, by considering the class N 8 , 61.6623% of the graphs could be distinguished (i.e., have unique values) by the Balaban J index. For N 10 , only 20.5633% out of almost 12 million exhaustively generated non-isomorphic graphs could be distinguished by J. But we can see in Table 3 that the information indices using the information functional approach [4,15,21] sketched in the 'Information indices' section can discriminate our graphs comparatively well. In particular, I l f D , with an exponential weighting scheme , discriminates 94.8005% out of almost 12 million exhaustively generated graphs successfully. In view of the large number and complexity of the graphs (see jN 8 j, jN 9 j and jN 10 j), the uniqueness of I l f D is striking. Observe that, for all weighting schemes [15], i.e., lin, quad, and exp, I l f V is much less discriminative. We realize that the underlying information functional f is crucial for reaching uniqueness of the information index. Also, we can clearly see that the uniqueness of other indices shown in Table 3 is quite low. We see that the Balaban U and X indices are among the best out of the set of known measures that we have chosen to perform this study.
Interestingly, the situation is somewhat the opposite when considering Table 2. Namely, for N 5 and N 6 , the discriminative power of the Balaban J index is higher than by using some of the information measures based on the information functional approach (e.g., I l f V lin and I l f P lin ). Also, we see that the underlying weighting scheme for the coefficients matters a lot, because I l f P exp has a higher discriminative power than the Balaban J index for N 6 and N 7 . In summary, we hypothesize that the Balaban J index performs well if the cardinality of the underlying graph set and the order of the involved graphs is rather small. By using a statistical approach, we will verify this hypothesis in the 'Statistical analysis' section. Let us give another example to shed light on the degeneracy of the measures when applying them to graphs [ N 10 , see Figure 1 and Table 4. Figure 1 shows four sample graphs [ N 10 where G 3 and G 4 are structurally quite similar in the following sense. If we remove the edge f2,10g in G 3 and the edge f6,10g in G 4 , the resulting graphs are isomorphic. From Table 4, we see that these graphs can only be fully distinguished by the degree-degree association index. Evaluating the Balaban J index on these graphs gives two degenerate graphs namely G 1 and G 2 . In contrast to this, I loc due to Konstantinova can not discriminate G 3 and G 4 . Finally, we observe that I a can not Table 3. Exhaustive sets of non-isomorphic graphs. jN 8 j~11117, jN 9 j~261080, jN 10 j~11716571. discriminate any of the four example graphs. This implies that every measure captures structural information differently and, hence, its discriminative power can differ dramatically because of N the underlying paradigm to define a graph measure, e.g., information-theoretic vs. non-information-theoretic indices or partition-based vs. non-partition-based N the underlying graph invariant to define a measure, e.g., degrees or distances or several graph invariants etc.
A comparison of the measures with others (e.g., see Table 3) is critical, as the measures rely on different concepts (e.g., information-theoretic vs.non-information-theoretic indices). In the following, we give plausible reasons why the measures using the information functional approach often capture structural information of exhaustively generated graphs more uniquely and significantly than other information measures for graphs that are based on determining partitions of graph invariants. This can also be underpinned by the numerical results; see Tables 2 and 3. Examples of the latter measures are the magnitude-based information indices I D and I W D due to Bonchev et al. [8], the degree information index I d [1] and the topological information content of a graph I a [31,42].
To construct classical partition-based measures of a graph G, we start with a graph invariant X and induce a partitioning according to an equivalence criterion. This results in the equivalence classes X 1 , . . . ,X k being obtained. The mean entropy is then given by The process of inducing the partitionings might be the reason for obtaining non-unique indices, as many structurally different graphs could possess the same or similar partitionings when using a certain equivalence criterion, e.g., vertex degree equality [1] or topologically equivalent vertices [31,42].
In order to derive information measures using the information functional approach, we assign a probability value (see equation 1) to each individual vertex in a graph by using a certain information functional f capturing its structural information. Examples thereof are equations 7 and 18. That means the information measures given by equations 4 and 5 can be understood as a cumulation of local quantities representing the vertex probabilities. Clearly, each such quantity captures a certain percentage rate of the structure of G. As the numerical results show, these measures conserve structural information more properly than the partition-based ones and result in highly discriminating measures for several graph classes. Note that other classical descriptors (see Tables 2 and 3), such as the Harary index, Randic' index [43,44] and the complexity index B etc., rely on the simple derivation of structural quantities (e.g., distances or degrees) to obtain a single numerical value characterizing the complexity the graph. Consequently, their discriminative power is very low; see Tables 2 and 3 Table 3), we observe that the difference between the resulting values is tremendous. Note that the graphs of N 8 ,N 9 , and N 10 contain cycles. A plausible reason for this is given in Figure 2.
We see on the left-hand side that the j-sphere cardinalities are rather small if j goes to r(G) and, hence, their contribution to the value of the particular functional for v i is small too. Also, there is not much variation between the j-sphere cardinalities. This could be a reason that the resulting probability values are quite similar to each other and, thus, this has a direct influence on the resulting value of the information index and on its uniqueness. In contrast, the right-hand side of Figure 2 shows that the values of D G (v i ,j) are more diverse and, in particular, those values when j goes to r(G) are larger than the j-sphere cardinalities. This might be a plausible reason why the corresponding vertex probability values are more different and, hence, the resulting entropies as well. As Tables 2 and 3 show, we again emphasize that the discriminative power of an index clearly depends on the underlying graph class.

Evaluation of the Discriminative Power by Using
Chemical Graphs. Here we evaluate the uniqueness of the Balaban J index, the information measures using the information functional approach, and the remaining topological descriptors shown in Table 1 by also using chemical graphs. Table 5 depicts the numerical results when applying the measures to chemical alkane trees representing the skeletal graphs. The number of  Table 4. Index values for the four example graphs depicted in Figure 1.  vertices ranges from 19 to 22. We see again that the discriminative power of the Balaban J index decreases when the number of graphs and vertices increase. The Balaban-like indices possess high discriminative power for all four graph classes. Also, we observe that the sum of the local vertex entropies (I loc ) due to Konstantinova [13,45] has high uniqueness. Interestingly, it is as good as I l Finally, the numerical results show again that the discriminative power of a structural index strongly depends on the underlying graph class. See, for instance, the results when comparing the uniqueness of I l f D exp for the alkane trees and exhaustively generated graphs (see Table 3).
Descriptive Statistical Analysis. In order to provide further evidence for stability of the uniqueness of I l f D by using exhaustively generated graphs, we perform a statistical analysis by using boxplots. The graph class to perform the study is N 10 . It is clear that, for computational reasons, the statistical analysis cannot be performed by using the entire set N 10 . Hence, we choose subsets of N 10 whose sizes are called sample sizes. Also, we perform the boxplot analysis for Balaban J as well, and present the resulting plots to investigate the dependence between uniqueness and sample size; see Figure 3. Concretely, 100 samples of 1100, 3300, 11 000, 33 000, 100 000, and 333 000 randomly chosen graphs out of N 10 have been analyzed by standard R boxplot routines. That means the medians have been calculated and plotted, with the first and third quantiles as hinges. The whiskers represent the calculated borders of the 95% confidence interval.
As we can see in Figure 3 the uniqueness values are not dispersed for a given sample size, but they depend on the sample size. Further, we observe that the uniqueness of the Balaban J index is not stable when the sample size is varied. In general, we call a measure I unstable if there is a strong dependency between the uniqueness of I and the sample size to perform the statistical analysis. In contrast, I is stable if there is only a very little dependency between the uniqueness of I and the sample size.
We see from the boxplot that the uniqueness decreases if the sample size increases. Based on our intuition, it seems reasonable that, the smaller the sample size, the better is the discriminative power of the measure under consideration. Thus I l f D possesses a non-trivial property, namely a very high discriminative power for exhaustively generated graphs that is almost independent of sample size. By using the above stated definition, we see that I l f D is stable on N 10 as the uniqueness is constantly high and does not depend much on the sample size. We see from Table 3 that I l f D exp is the only topological descriptor possessing this property. Other topological measures, and particularly the Balaban J index, have the trivializing property that, for exhaustively generated graphs, the uniqueness is only reasonable for small sets of graphs.
Hence some of the entropy measures using the information functional approach could be applied successfully for discriminating sets of large complex networks as well. Keep in mind that in fact such classes of exhaustively generated complex networks possess huge cardinalities. Note that the cardinality of the exhaustively generated non-isomorphic graphs with 10 vertices is already greater than 11 million. As we conclude from this statistical analysis, I l f D exp possesses the stability property that is necessary to achieve feasible results when applied to sets of large complex networks.

Summary and Conclusion
In this paper, we have dealt with the problem of evaluating the discriminative power of topological graph measures by using exhaustively generated, non-isomorphic graphs without vertex and edge weights. We have made an attempt to translate topological indices into the field of complex networks when evaluating their uniqueness. We found that one of the information measures for graphs using the information functional based on degree-degree associations outperformed the Balaban J index tremendously. Also, by using the graph class N 10 , we found that the uniqueness of the Balaban J index is quite sensitive to varying sample size when performing the statistical analysis; see 'Statistical analysis' section. In particular, the uniqueness of the Balaban J index deteriorated when increasing the sample size. This makes Balaban J in particular non-feasible for discriminating complex networks structurally as they are multicyclic, do not have structural constraints, and the cardinality of an underlying set of such networks is huge. This property was also observed by using other topological indices shown in Table 1. The numerical results when using exhaustively generated graphs and alkane trees can be found in Tables 2, 3, and 5. Altogether, this study clearly shows the limitations of topological indices and restrictions when applying them on a large scale. A topological index can be unique for a particular graph class but it fails when applying the measure to another class. In this sense, it is far from trivial that we obtained an index (see the definition of I l f D ) that turned out to be highly discriminating for exhaustively generated graph classes. Note that the underlying graphs do not possess structural constraints.
As to future work, we will evaluate further topological indices on a large scale to obtain deeper theoretical insights. From such an analysis, one can also learn how the measures capture structural information. This relates to better understanding of their structural interpretation. We are convinced that these developments could also trigger future developments positively when developing and investigating topological graph measures in the context of complex networks.

Author Contributions
Analyzed the data: MD MG KV. Wrote the paper: MD MG KV.