A novel bibliometric index with a simple geometric interpretation

We propose the χ-index as a bibliometric indicator that generalises the h-index. While the h-index is determined by the maximum square that fits under the citation curve of an author when plotting the number of citations in decreasing order, the χ-index is determined by the maximum area rectangle that fits under the curve. The height of the maximum rectangle is the number of citations ck to the kth most-cited publication, where k is the width of the rectangle. The χ-index is then defined as kck, for convenience of comparison with the h-index and other similar indices. We present a comprehensive empirical comparison between the χ-index and other bibliometric indices, focusing on a comparison with the h-index, by analysing two datasets—a large set of Google Scholar profiles and a small set of Nobel prize winners. Our results show that, although the χ and h indices are strongly correlated, they do exhibit significant differences. In particular, we show that, for these data sets, there are a substantial number of profiles for which χ is significantly larger than h. Furthermore, restricting these profiles to the cases when ck > k or ck < k corresponds to, respectively, classifying researchers as either tending to influential, i.e. having many more than h citations, or tending to prolific, i.e. having many more than h publications.


Introduction
The debate in bibliometrics on quality versus quantity in evaluating academic research performance is still an ongoing concern [1]. One perspective is to view the number of publications of a researcher (P) as a measure of quantity and the total number of citations to these publications (C) as a perceived measure of quality; several variants of these, such as the average number of citations per publication, the number of citations to the top or the 10th most cited publication, and the number of publications with at least 10 citations, have also been suggested [2]. Although these simple metrics tend to take into account only one facet of a researcher's impact, several other bibliometric indices, such as the h-index [3], the g-index [4] and generalisations of these [5], combine both citation and publication counts.
An extensive review of the h-index and some of its variants was provided by Egghe in [6], and, a comparison of 37 variants of the h-index was given by Bornmann et al. in [7]. In addition, Waltman and van Eck [8]  variants, and proposed a family of bibliometric indicators that do not suffer from these inconsistency problems. Of particular interest are extensions of the h-index, which take into account the full publication list of a researcher such as the tapered h-index [9]. Proposals for new variants of the h-index continue to appear, for example [10][11][12][13], as do comparisons and evaluations, for example [14,15].
Nevertheless, the h-index and its variants do not normally take into account the full citation list of a researcher. This could be perceived as a drawback; however, the total citation count has the disadvantage of biasing the index in favour of researchers with very highly-cited top publications or very many publication with a relatively small number of citations. We now review the h-index and some of its variants, and then introduce the χ-index, a new index that addresses some of the drawbacks mentioned.
The h-index of a researcher is the maximum number h of the researcher's publications such that each has at least h citations [3]. Equivalently, consider the citation vector, hc 1 , c 2 , . . ., c n i of a researcher, where the c i , the number of citations to publication i, are sorted in descending order, i.e. c i ! c j if i < j. Here we assume that for all i, c i > 0, and that h will be zero in the absence of any citations; this is consistent with defining the value of a bibliometric index of a researcher to be zero if none of the researcher's publications have been cited [16]. The h-index is thus the largest rank h for which c h ! h. The h-index is completely insensitive to the fact that a researcher's top few publications may be very highly cited, and conversely also to a researcher having a fair number of publications whose number of citations is less than but close to h [17]. A suggested improvement over the h-index, which gives extra weight to highly cited publications, is the g-index. The g-index of a researcher is the largest rank g for which P g i¼1 c i ! g 2 [4]; it is easily shown that g ! h. A problem with the g-index is that it may still be biased since, if a researcher has a few publications that are very highly cited and the rest have very few citations, the g-index will still be high. This is because the g-index is equal to the largest rank g such that the average number of citations up until that rank is at least g. Suppose the h-index of a researcher is h, then the h-core is the set of the h most highly-cited publications for this researcher. The A-index, which is the average number of citations to the publications in the hcore, i.e. A ¼ P h i¼1 c i =h, was defined as an attempt to address the fact that the h-index does not take into account the total number of citations to publications in the h-core [18]. However, the A-index suffers from the fact that taking an average will, all other things being equal, often favour authors with fewer publications when they are highly cited. To remedy this issue, the Rindex has been proposed, where R ¼ 18]. It is easy to see that h R A.
Nevertheless, the A and R indices, and to a lesser extent the g-index, ignore the effect of publications outside the h-core, which are also part of a researcher's output. A recent proposal is the Euclidean-index [19] (which we call the E-index), designed to take account of the full list of an author's cited publications; it is defined as the Euclidean norm of the citation vector, i.e.
In order to motivate the χ-index, we first observe that, given a citation vector for a researcher, for any k, k n, the researcher has at least k publications with c k or more citations. It follows that the h-index is the largest h such that c h + 1 h, i.e. for all h 0 > h, c h 0 h. So, for example, if one author has a single publication with 100 citations and another has 10 publications each with 10 citations, then the h-index of the former is 1 while the h-index of the latter is 10. At the other extreme, an author with 100 publications, each with a single citation, has an h-index of 1. The argument for favouring publications with a higher number of citations is normally that of quality versus quantity. However, such an approach, on the one hand, disadvantages a researcher with a few very highly cited publications, who may have carried out some very influential seminal research, whilst, on the other hand, it also disadvantages a prolific researcher who may have many collaborators but fewer citations per publication. Avoiding the debate of number of citations versus number of publications, we propose an index for which all three afore-mentioned scenarios, (i) 1 publication with 100 citations, (ii) 10 publications with 10 citations each, and (iii) 100 publications with 1 citation each, are considered as equally desirable. So the χ-index is essentially the largest product ic i where 1 i n; however, for comparison purposes with the h-index, we will actually define the χ-index to be the square root of this, i.e. ffiffiffiffi ffi ic i p . Thus, in all three scenarios the χ-index of the researcher is 10; see Fig 1, which illustrates the three scenarios in a geometrical context. If we let k denote the value of i that maximises ic i , we see that in all three cases, the researcher has exactly k publications with c k or more citations. It is clear that the h-index cannot be larger than the χ-index, since ffiffiffiffiffiffi hc h p ! h. A possible future line of research would be to investigate pairwise combinations of the χindex with other indices, along the lines of the two-variable metrics examined in [2].
The χ-index is formally introduced in Section 2, generalising the h-index by allowing the interplay between k (the number of publications, representing quantity) and c k (the number of citations, representing quality). We also list some properties of the χ-index, which could form the basis of its axiomatisation (cf. [16,20]), and explain the computational methods we use for the empirical analysis in the following sections. In Section 3 we introduce the two data sets analysed, a large Google Scholar data set, described in Subsection 3.1, and a small data set of Nobel prize winners, described in Subsection 3.2. In Section 4 we present the main analysis of the data sets and results obtained. In Subsection 4.1 we analyse the Google Scholar data set, and in Subsection 4.1 we turn our attention to the Nobel prize winners data set. Our main tool here is to partition the researchers into three classes, (i) when k is approximately equal to h, (ii) when k is significantly greater than h and (iii) when k is significantly less than h. We further partition that data according to whether χ is approximately equal to h or significantly larger than h to get a sense of when these two indices differ. Membership of the classes is determined by a basic bootstrap percentile method [21], Section 5.3.1] described in Section 2. In Section 5 we give our concluding remarks. (We note that we use the terms author and researcher interchangeably).

Methods
The citation curve is the curve resulting from plotting the number of citations against the ranking of the publications, as specified by the citation vector. The χ-index is the square root of the maximum area rectangle that can fit under the citation curve (see Fig 1). Formally, where c i is the number of citations to publication i in the citation vector hc 1 , c 2 , . . ., c n i, which represents all cited publications in decreasing order of the number of citations. In the following we let k denote the value of i that maximises ic i . We note that, since square root is monotonic, it does not affect the ranking of researchers implied by (1). It is, however, convenient for comparison with the h-index and its derivatives. This can be viewed as the requirement from physics, known as dimensional homogeneity, that we only compare quantities that have the same units [22]. The square root accords with the geometrical interpretations of the h and χ indices: the h-index is the square root of the area of the maximal square that fits under the citation curve [23], and the χ-index the square root of the area of the maximal rectangle. It could also be interesting to consider aggregate functions other than the maximum in (1), for example, minimum, average or average of the minimum and maximum, although these seem to be rather less intuitive in the context of bibliometrics.
Several researchers have studied various properties of citation indices [16,20] in an attempt to provide objective justification for comparison between indices, and where possible to obtain an axiomatisation of the indices. We list some properties of the χ-index, desirable properties that the χ-index possesses and one that it does not; we leave a complete axiomatisation of the χ-index to future work.
where n is the number of cited publications and c 1 is the number of citations to the most highly cited publication.

for all
4. The χ-index is monotonic [16,19], in the sense that adding citations to an existing publication or adding a new publication to the list do not lower the index. (Note that the h-index is also monotonic). [19], in the sense that multiplying the number of citations to each publication by a constant does not change the relative ranking of two citation vectors.

The χ-index is scale-invariant
(Note that the h-index is not scale-invariant).
6. The χ-index is not independent [19], since adding a new paper with the same number of citations to two citation vectors may change their relative ranking. For example, the χ indices of both h2, 2i and h1, 1, 1, 1i are 2, however the χ-index of h2, 2, 1i is still 2 but the χindex of h1, 1, 1, 1, 1i is ffiffi ffi 5 p . (Note that the h-index is also not independent).
In the following sections we carry out an empirical analysis of the χ-index, comparing it to the citation indices mentioned in the introduction, however, focusing our attention on the comparison of the χ-index and the h-index. We make use of a large data set compiled by Radicchi and Castellano from Google Scholar [24], and also analyse a small data set of 99 Nobel prize winners; both are described in Section 3.
Our initial comparison between the indices is carried out using the Spearman rankcorrelation coefficient [25], which demonstrates that the indices we are comparing are all highly correlated, except for P, the number of cited publications. We carry out a more in-depth comparison of the χ and h indices in Section 4, by separating authors whose χ and h indices are approximately the same from those for which they are significantly different.
We make use of the bootstrap method [21], which is a technique for computing a statistic that relies on random resampling with replacement from a given sample data set. The bootstrap method is usually nonparametric, making no distributional assumptions about the data set employed. In its basic form, for example, it can be used to estimate the distribution of the population mean by computing sample means over a large number of bootstrap resamples taken from the original data set. The specific method we use to classify the authors is the basic bootstrap percentile method [21], Section 5.3.1]; see also [26], which also uses the bootstrap method in the context of bibliometrics. In particular, we resample author citation vectors 1000 times, with replacement, compute the h-index for each resample, and then compute a 99% one-sided confidence interval for the h-index values, starting from the lowest one from the 1000 resamples. This allows us to determine for a given author whether k is approximately equal to h and, additionally, whether χ is approximately equal to h by checking whether k or χ are in the confidence interval or not.
We thus first partition the authors into three classes, according to whether (i) k % h, (ii) k > h, or (iii) k < h, where % means approximately equals. The second and third classes capture a tendency of an author towards being prolific when k > h, or influential when k < h. (This does not imply that when k % h the researcher is not prolific or influential, rather the distinction is meant to highlight the two opposing cases). We further partition each class according to whether χ % h or χ > h to see when the indices differ, and to get a sense of the proportion of researchers for which χ % h. Finally, we also consider the subclasses of χ > h, depending on whether c k > k or c k < k.

Data sets and preliminary analysis
We now introduce the two data sets, provide some basic statistics of these data sets, and compute the correlations between various indices for the researchers concerned. In Subsection 3.1 we consider the Google Scholar data set and in Subsection 3.2 we consider a data set of Nobel prize winners.

Google Scholar data set
For our main analysis, we made use of a large data set of Google Scholar profiles compiled and made available by Radicchi and Castellano [24]. The full data set contains approximately 90,000 citation vectors of authors across all disciplines, collected between June 29 and July 4, 2012. As in [24], we only included authors who had validated their Google Scholar account, and we removed authors with fewer than twenty publications, publications with no citations and publications dated before 1945. We then filtered the data further to include only authors having a career of five years or more, where the career is deemed to begin from the year of the first published paper within the window of years considered. After this preprocessing step, the final data set we used was reduced to 34,393 citation profiles.
We start by presenting, in Table 1, the basic statistics for the various indices introduced in and P, stand for the h-index, the g-index, the A-index, the Rindex, the square root of the Euclidian-index, the χ-index, the square root of the total number of citations and the number of publications, respectively. (We note that we have chosen to use ffiffiffi E p and ffiffiffi ffi C p for comparison purposes). It can be seen that the number of cited publications P stands out as a clear outlier, and also A, to a lesser extent. Moreover, apart from min, the statistics for h are the lowest, closely followed by ffiffiffi E p . In Table 2, we present the Spearman rank-correlation coefficient r [25] between the various indices, noting that when computing the Pearson correlation [25] the results were similar; due to symmetry we only present the upper triangle of the correlation matrix. (We note that while the Pearson correlation measures the strength of a linear association between two random variables, the Spearman rank-correlation measures the strength of a monotonic association between the two, which may be nonlinear [27]). We observe that P has the lowest correlation with any of the other indices, and that all the other indices are highly correlated with each other. We further note that, although ffiffiffi ffi C p is indeed highly correlated with all the other indices From now on, we will concentrate on comparing the h and χ indices, h being the most commonly employed index, and leave detailed comparison to other indices for future work.
We start by showing, as was done in [24], that the probability density functions of the h and χ indices both follow log-normal distributions [28,29]. To this end we introduce the Jensen-Shannon divergence (JSD) [30], which is a nonparametric measure of the distance between two empirical distributions p = (p i ) and q = (q i ), where i = 1, 2, . . ., n.
The formal definition of the JSD, which is a symmetric version of the Kullback-Leibler divergence and is based on Shannon's entropy [31], is given by where we use the convention that if p i = 0 or q i = 0, or both, 0 ln 0 and 0 ln (0/0) are both defined to be 0. (The factor 2 ln 2 is included to normalise the JSD to be between 0 and 1). We observe that the JSD is equal to 0 when p = q.
In Table 3 we give the mean μ, and standard deviation σ of the log-normal distributions fitted by the maximum likelihood method, and the JSD between the empirical distributions of the h and χ indices and the fitted log-normal distributions. The low JSD values indicate good fits for both indices. We also note that the means and standard deviations are quite close.

Nobel prize winners data set
For our second data set, we collected the citation vectors of 99 Nobel prize winners across a variety of disciplines from the Web of Science platform [32]. We included only authors having twenty or more publications, and only those publications with citations. However, for this data set we considered their full careers without a cutoff date. In Table 4, we present the basic statistics for the Nobel laureates, while in Table 5 we present the Spearman rank-correlation coefficient. As one would expect, the statistics are, overall, much higher than for the Google Scholar data set, although for this data set A is more of an outlier than P. On the other hand, the correlations are comparable to the Google Scholar data set, although, on average lower. In Table 6 we show the parameters of the log-normal distribution fitted by the maximum likelihood method, and the JSD between the empirical distributions of the h and χ indices and the fitted log-normal distributions. As for the Google Scholar data set, the low JSD values indicate good fits for both indices. We again note that the means and standard deviations are quite close.

Analysis and results
We now analyse the data sets introduced in Section 3, with the aim of revealing how authors are separated into classes depending on whether k % h or not, or whether χ % h or not. In Subsection 4.1 we analyse the Google Scholar data set, and in Subsection 4.2 we analyse the Nobel prize winners data set.

Results for Google Scholar data set
In Fig 2, we see three examples of authors according whether (i) k % h, (ii) k > h, or (iii) k < h, exhibiting the geometry of the h and χ indices. When k > h there are many publications, each with fewer than h citations (tending towards prolific), and when k < h therefore fewer publications, each with more than h citations (tending towards influential). In Table 7, we exhibit the breakdown of the three classes for the Google Scholar data set, noting that k < h is the largest class, the other two comprising just over 53.50% of the data set. It is also apparent that, within the class k < h, there are by some margin, more authors for which χ > h. What this means is that, when χ is significantly larger than h, we expect that k will be significantly smaller than h, i.e. we expect the author to have several publications with more than h citations, contributing to χ being larger h; this can be justified from the data in Table 7 with the use of Bayes theorem. This confirms that the χ-index addresses a problem of the h-index that it does not sufficiently take into account highly cited publications. The statistics in Table 8 for the three classes further confirm this property of the χ-index, showing higher average values for the χ-index when k < h.
Moreover, it can be seen in Table 9 that out of all authors, there are 28.60% for which χ is significantly larger than h, clearly demonstrating the potential of the χ-index to separate authors that may have similar h indices. In addition, the statistics shown in Table 10 indicate higher average values when χ > h. The breakdown of the χ > h class, when c k > k and c k < k, can be seen in Table 11, while the basic statistics pertaining to these classes are shown in Table 12. It can be seen that the average values for the larger subclass, c k > k, are much higher than those for the smaller subclass, c k < k.

Results for Nobel prize winners data set
The Nobel prize winners data set looks at the extreme case of researchers having, on average, very high h values and therefore also very high χ values. In In Table 13, we see a significant difference from the Google Scholar data set, since for about 80% of the laureates we have k < h and, of those, for over 75% of the authors χ > h. As expected, this implies that, overall, Nobel prize winners are influential. Looking at the statistics in Table 14, we see that when k < h, on average, the χ values of researchers are much larger than the h values. This is due to publications with a large number citations, significantly more than h. An interesting observation is that unlike Table 8, where the values of the χ-index are the highest when k < h, in Table 14 χ is highest for the smaller class when k > h. This is most likely due to a long tail of highly cited publications for these few laureates.
In contrast to Table 9, it can be seen from Table 15 that χ > h for over 60% of laureates. However, as the statistics in Table 16 reveal, in contrast to Table 10, the h-index for those Nobel prize winners with χ % h, is actually, on average, higher than both the h and χ indices of the laureates with χ > h. This may indicate that for very influential researchers, such as Nobel laureates, when χ > h the h-index undervalues their contribution. The breakdown of the χ > h class, when c k > k and c k < k, can be seen in Table 17, while the basic statistics pertaining to these classes are shown in Table 18. It is interesting to note that as opposed to the Google scholar statistics shown in Table 12, the average values for the Nobel laureates subclass c k > k are, in fact, much lower than those for the subclass c k < k. This latter class is quite small as there are only three such Nobel prize winners; see Table 18. As noted above this is most likely due to a long tail of relatively highly cited publications for these few laureates.

Concluding remarks
We have presented a new citation index, the χ-index, which addresses some shortcomings of the h-index in terms of the balance between number of citations and number of publications. The χ-index has a simple geometric characterisation in terms of the largest area rectangle that fits under the citation curve; this generalises the h-index for which the rectangle is constrained to be a square.
We have analysed two data sets, a large one from Google Scholar and a small one of Nobel prize winners. Studying these data sets clearly shows the utility of the χ-index. First, as with many of the citation indices that combine number of citations (proxy for quality) with number of publications (quantity), the χ-index correlates strongly with the square root of the total number of citations, yet it is selective in its choice of publications to include in the index. Second, as we have seen from our analysis, there are many researchers whose χ-index is significantly larger than their h-index due to their tendency to be influential, in the case k < h, or prolific in the case k > h. We believe that this property of the χ-index is beneficial and could lead to a more satisfactory ranking of researchers than that obtained using the h-index.