PR-Index: Using the h-Index and PageRank for Determining True Impact

Several technical indicators have been proposed to assess the impact of authors and institutions. Here, we combine the h-index and the PageRank algorithm to do away with some of the individual limitations of these two indices. Most importantly, we aim to take into account value differences between citations-evaluating the citation sources by defining the h-index using the PageRank score rather than with citations. The resulting PR-index is then constructed by evaluating source popularity as well as the source publication authority. Extensive tests on available collections data (i.e., Microsoft Academic Search and benchmarks on the SIGKDD innovation award) show that the PR-index provides a more balanced impact measure than many existing indices. Due to its simplicity and similarity to the popular h-index, the PR-index may thus become a welcome addition to the technical indices already in use. Moreover, growth dynamics prior to the SIGKDD innovation award indicate that the PR-index might have notable predictive power.


Introduction
The problem of objectively assessing the impact of an individual author has been the subject of intense research in bibliometrics as well as many other fields of research. While it may be relatively easy to distinguish a Nobel Prize winner from an average researcher, it is much more difficult to rank all authors. Yet, many have tried, using quantitative technical analysis of various indicators ranging from the number of publications and patents to the number of citations. Such ranking systems have found widespread use in funding agencies and tenure committees across the world to supplement objective and comprehensive assessment of each individual researcher's impact. These indicators can also provide a fast glimpse into a field of research and aid in identifying experts or, at minimum, the most productive and well-known authors. However, technical indicators are also relatively easy to manipulate, and so care must be exercised; a thorough determination of impact should always include a human evaluation as well.
Numerous indicators have already been proposed. These can be roughly classified into two groups: statistical-based indicators and graph-based indicators. Statistical-based indicators typically depend on the sheer number of publications, patents, citations, or co-authors. Among these, the h-index [1] is probably the most famous and widely used. Graph-based indictors, on the other hand, explore the relationships within an academic network, such as a publicationcitation network, an author citation network or a co-author network. Author impact can be assessed based on the structural properties of such academic networks in lieu of statisticalbased indicators.
The h-index [1], on which this proposed PR-index is based, has several notable advantages and desirable properties. Because of its simplicity and intuitive value, the h-index is used widely in several academic ranking systems, including the Web of Knowledge [2] and the Microsoft Academic Search [3]. However, h-index rankings may also be misleading and manipulated. For example, self-citations [4] could increase a ranking, although originally it was claimed that is not an issue. In addition, the h-index treats all citations equally, so it does not take into account the quality of each citation. These disadvantages led to the development of a now-large number of variants of the h-index. Batista et al. [5] have taken field dependence into consideration, thus making it possible to quantify an author's scientific contribution across different research fields. Schreiber [6] has proposed the index h s to eliminate the negative effects of selfcitations. Liu et al. [7] have introduced a modification of the h-index for multi-authored papers with contribution-based author name ranking. Zhang has proposed h'-index [8] and e-index [9], which both consider the whole set of citation information available for each author. A review of many different variants of the h-index was performed by Lutz Bornmann et al. [10], who reviewed and tested 37 different h-index variants for consistency and correlations between them.
The many proposed variants of the h-index are aimed specifically at mending some of its aforementioned deficiencies, but so far, few have explicitly taken the rationality of the citations into consideration. Sometimes, citations may not reflect an author's or publication's status accurately [11][12][13][14]. The original h-index may give a high score to an author who has published many highly cited reviews. This reflects the popularity of these publications, but not always reflects their authority in moving the field ahead.
Compared to statistical-based indicators, graph-based indicators consider the relationship between authors and their publications based on co-authorship and citation networks. PageRank, formulated by Brin and Page [15,16] for assigning a rank to all Web pages, is one of the most famous graph-based indicators. In an academic network, an author will receive a high PageRank score if he or she is cited by (a co-author with) many other high-impact authors. For example, although two authors may have the same number of citations (or co-authors), they may receive different PageRank scores because the quality of the citations (or the co-authors) is considered as well. We briefly mention here two PageRank algorithms that are based on different networks, as follows: 1. Citation networks of authors: Ding [17] has proposed a weighted PageRank algorithm based on the citation network of authors. In her work, an author will receive a high rank score if that author is cited by many well-respected authors.
2. Collaboration networks of authors: Liu et al. [18] have proposed a PageRank algorithm by considering the frequency of co-authorships and the total number of co-authors on articles. Using this algorithm, highly co-authored and prolific authors will gain reputation. Yan et al. [19] have provided an alternative perspective for measuring author impact by applying a weighted PageRank algorithm that considers citation and co-authorship network topology.
Moreover, Fiala et al. [20] have proposed a modified PageRank algorithm that considers the relationship between both co-authorship and citation. Moreover, they also integrated PageRank with a time factor in a subsequent work [21].
Although the PageRank algorithm shows a great promise in academic rankings, it has some limitations: 1. PageRank based on author citation relationships may exaggerate an author's research impact to a certain extent. For example, if a less prominent author has co-authored papers with a famous scholar and published three or four highly cited papers, that author will receive a high PageRank score.
2. PageRank based on co-authorship may also not properly reflect an author's research impact. If an author's PageRank score is high, it just means that he or she is widely co-authored. This indicator may reward authors for adding extra names or more famous names to the author list.
To overcome some of the limitations of both statistical and graphbased indicators, we propose a new "PR-index," which is a combination of both. In brief, the PR-index is a variant of the h-index, which instead of simply considering citations to the papers by using the PageRank score of each paper within the citation network, which won't increase the computational complexity. Obviously, this requires both constructing the citation network for publications and determining their PageRank, but otherwise it is as straightforward as determining the h-index. By replacing the citations with the PageRank score of each paper within the citation network, we obtain an index where both the popularity and the relevance of each particular author's works are properly taken into account.
In the remainder of this paper, we first present a detailed account of our method in the section of PR-index. Then, we introduce the main results obtained with the PR-index and compare them with the results obtained with other indices in the section of Experiments and discuss their implications, as well as the predictive power of the PR-index in the section of PRindex Sequence. Final, we conclude our contribution in the section of Conclusion.

Motivation of PR-index
A bibliographic information network consists of rich information such as papers, authors and journals. As shown in Fig 1, there are three types of networks co-exist: the citation network of authors (Fig 1(a)), the co-authorship network of authors (Fig 1(b)), and the citation network of publications (Fig 1(c)). Fig 1 represents these relationships as black dotted lines. Given these relationships, we can define the problem of author evaluation as: How can we assess an author's contribution according to these relationships?
The h-index, as a statistical-based indicator, was suggested by Hirsch [1] as a tool to determine authors' impact. In Hirsch's work, the index h for a scientist means that at least h papers from all his/her own N p papers have been cited more than h times, and the other (N p − h) papers have been cited fewer than h times. Actually, the information taken into consideration by the h-index is the red line area in Fig 1(c). In Fig 1(c), the h-index of a 3 is 1 because he has published one paper that has been cited one time. Moreover, the h-index treats all citations equally and does not take the citation quality into consideration.
In addition to the h-index, which takes only statistical information into consideration, PageRank is also applied to the author impact assessment. Some authors [17][18][19][20][21] have modified its basic formula and applied it to both author and publication impact assessment. These methods can be grouped as follows: 1. PageRank based on author citation relationship (denoted as PR_AC). As shown in Fig 1(a), when an author is cited by many high-impact authors, he will achieve a high rank. In Fig 1  (a), author a 2 gets the first rank.
2. PageRank based on co-authorship (denoted as PR_CO). As shown in Fig 1(b), the more frequently the author collaborates with high-impact authors, the higher the rank that author will have. Author a 2 gets the highest rank using this method as well.
3. PageRank based on publication citation relationship (denoted as PR_PC). As shown in Fig  1(c), when a paper is cited by many high-impact articles, that paper will receive a higher rank, such as the paper p 2 .
To overcome the limitations of the h-index and PageRank as discussed in the Introduction of this paper, the PR-index extends the h-index by combining it with PageRank. As shown in Fig 1, the PR-index considers publication and citation quantity but also takes a publication's citation network into consideration. This means that the PR-index will rank majority authors higher by applying the PR_PC to distinguish high quality citations from low quality one.

Formulation of PR-index
The main idea behind the PR-index is to calculate an h-index based on publication's PageRank score rather than on citations. In some cases, a highly cited publication may not be of high quality. However, the PageRank score of such papers is much more reasonable because it takes both the popularity and the authority of each paper into consideration. We argue that an author who has a high h-index should have published many high-quality publications rather than many highly cited publications.
The process to create the PR-index consists mainly of calculating PageRank score, transforming PageRank score, and calculating the h-index.
(1) PageRank score calculation First, we need to determine the PageRank score (PR_PC) for each paper. In the citation network of publications, the score of each paper can be worked out according to the following Three kinds of relationship in the academic research field. a1, a2, a3 denote three authors, p 1 , p 2 , p 3 denote published papers. A black solid line means that an author published a paper. (a) illustrates the author citation relationship. The dashed arrow from a 1 to a 2 denotes that a 1 cited a 2 in a paper. (b) represents co-authorship among authors. There are two directed dashed edges between a 1 and a 2 which means that a 1 has co-authored with a 2 . (c) represents for the publication citation relationship. A directed dashed edge from p 1 to p 2 means p 1 has cited p 2 .
where N represents the total number of papers, p is one paper and p i is a paper that cites p. PR_PC(p) and PR_PC(p i ) are the PageRank scores of paper p and p i , respectively; Cite(p i ) is the sum of publications that cite p i .
(2) PageRank score transformation First, we obtain a rank queue {p 1 , p 2 , . . ., p n } by sorting the PR_PC score for each paper in descending order.
Second, we need to calculate the PRCite score of paper p h as follows: where p 1 is the first rank paper. PR_PC(p 1 ) and PR_PC(p h ) are the corresponding PageRank scores of paper p 1 and p h respectively. Cite(p 1 ) is the citation score for paper p 1 . The PRCite score of paper p 1 is equal to Cite(p 1 ). Third, we revise the PRCite score so that the PRCite score of the last ranked paper is 0. The revised formula is: where PRCite(p n ) is the score of the latest ranked paper in the first step. Finally, for some papers, PRCite 0 (p h ) is greater than their citations, so we need to revise PRCite 0 (p h ) for all publications according to the formula below: The algorithm of PR-index can be briefed in Table 1.

Experiments Dataset
By Microsoft Academic Search API [22], we extracted publicly available data from Microsoft Academic Search [3] based on the keyword of "Data Mining" from 1992 to 2011. This dataset contains publication information, including title, authors, publication references, and so on. The dataset contains a total of 32410 publications and 51938 authors.

The Distribution of Each Indicator
Based on the data collected from Microsoft Academic Search [3], the indicators introduced and described in the Method Section (i.e., the number of publications, citations and co-authors, as well as PR_AC, PR_CO, the h-index and PR-index), can be determined and evaluated for consistency and relevance. Fig 2 plots the distribution of different indicators with logarithmic charts. The number of publications, citations, co-authors, the h-index and the PR-index all exhibit a fat tail and may be approximated by a power law.

Correlation Analysis
The scatter diagrams in  quality papers. Meanwhile, an author with many citations may have published only a few articles. This is the main reason why publications or citations alone cannot reflect an author's achievement appropriately.
Co-authors and PR_CO are indicators based on cooperative relationships between authors. As shown in Fig 3, these two indicators have low correlation with other indicators, such as publications and citations. Actually, an author with high output and citations may receive a low ranking based on co-authors or PR_CO.
From the scatter diagrams, h-index and PR-index correlate well with other indicators. An author trying to achieve a higher rank must produce many high-quality papers to gain a better reputation.

Comparison of Different Indicators
This section estimates which indicators objectively reflect authors' impact in the field of scientific research. In bibliometrics, there are no standard indicators for reference. Sidiropoulos et al. [24] and Yan et al. [19] have evaluated award winners' ranking results, reasoning that authors who have won awards should have a higher rank. This work adopts their method and uses the SIGKDD innovation award to evaluate the results of each indicator.
In Table 2, the top 20 authors given by different indicators are listed, with the SIGKDD innovation award winners shown in bold text. Clearly, JW Han is the most influential author, and receives the first rank in all indicators. Other authors, such as R Agrawal, UM Fayyad also achieve high ranks. Meanwhile, authors such as JH Friedman don't place in the top 20.
The ranks of the SIGKDD innovation award winners are presented in Table 3, which shows that citations, PR_AC, h-index and PR-index result in a higher rank for these winners. In contrast, their publications, PR_CO and co-authors rankings are quite low. The following paragraphes discuss the ranking of each indicator in more depth. (1) Publications It is well known that ranking authors by the total number of publications has some shortcomings. Such a ranking places lopsided emphasis on authors' output while omitting consideration of the quality of their papers. In Table 3, the rank given by the total number of publications is quite low, which also reflects its unsuitability as an indicator.
(2) Co-authods and PR_CO Co-authors and PR_CO tend to provide lower author rankings compared to other indicators. As Table 3 shows, some award-winning authors receive a low rank because they have not co-authored with many other authors. However, in reality, these authors have published many high-impact articles.
Currently, PR_CO is recognized as a useful indicator in the area of informetrics [19]. In this work, we argue that PR_CO simply reflects the centrality of an author in a co-author network. By co-authoring with numerous authors (such as adding many authors who have contributed nothing in the author list of a paper), an author can increase their ranking even if most of those co-authors have average rankings. Thus, sometimes, having large numbers of co-authors is not meaningful.
(3) Citations and PR_AC As Table 3 shows, citations and PR_AC are the indicators which give the highest rank to the award winners. Ding's work has shown that PR_AC is a well-designed indicator and concludes that PR_AC is better than other indicators [17].
However, when we focused on the essence of these two indicators in the original dataset, we found that some authors have an inappropriate rank. Table 4 lists the rank of the top 20 authors as measured by PR_AC along with their scores on other indicators are both presented. The items in bold text are authors who have an inappropriate rank. For example, Y Yin, who has published 7 papers, with 1192 citations and 7 co-authors gets a rank of 16. That's because Y Yin co-authored with JW Han and published the paper Mining frequent patterns without candidate generation, which has acquired 840 citations. Meanwhile, Yin is the third author of this paper. This paper's high citation count leads directly to Yin's high ranking. So, if PR_AC or citations serve as standard indicators, they may cause a misleading result.
(4) h-index and PR-index When compared with citations and PR_AC, h-index and PR-index both assign a lower rank to the award winners. This means that authors such as Y Yin will receive a lower rank (h-index Table 3. Comparison of ranks of authors who are SIGKDD innovation award winners based on different indicators. The boldface refers to the minimum number of each row. PR_AC gives the highest ranking to awarded authors. Leo Breiman who awarded the SIGKDD innovation award in 2005 is not list in this table. Actually, our dataset is clawed based on the keyword "data mining" and Prof. Leo mainly focuses on the statistics and machine learning. Therefore, there are a few records of Leo Breiman in our dataset. rank is 150; PR-index rank is 79). In fact, an author will be ranked highly by h-index and PRindex only if that author has truly published many influential papers. Compared with the h-index, the PR-index assigns a higher rank to awarded authors, because the PR-index is based on publication quality rather than citations. Earlier in this paper, we discussed a theoretical shortcoming of the h-index which may exaggerate the ranking of authors who have published many highly cited reviews. The results here are evidence that high numbers of citations don't necessarily equal high quality work. The PR-index is based on the PageRank score of publications rather than on citations, which optimizes the ranking results to some degree.

PR-index Sequence
In Liang's view [25], the h-index does not address different evolution mechanism for each author. His paper proposes the h-sequence instead, and discusses the evolution mechanism. This section aims to explore the evolution mechanism of the PR-index for those authors ranked the highest by the PR-index.
Based on the ranking results of PR-index in Table 2, the top of 7 authors are selected as an example to illustrate the evolution mechanism of PR-index sequence. As shown in Fig 4, the PR-index indicator of each author decreases over time. In Hirsch's work, it is expected that the h-index will decrease linearly. In contrast, Liang believes that there are different types of evolution mechanisms that can affect the decrease of h-index. Therefore, an decrease in h-index can be linear curve, "s" curve, and Lorenz curve. The PR-index also presents these dynamic types. Intuitively, JW Han is the most influential author, and has a PR-index of 25. Han has declined rapidly from 1995 to 1999, but makes slow progress from 2005 to 2011. Moreover, the PR-Index: Using the h-Index and PageRank evolution mechanism of PR-index for SIGKDD innovation award winners (i.e., UM Fayyad and R Agrawal) is different from other authors who have not won this award. Fayyad and Agrawal decreased rapidly from 1995 to 1996 due to their important fundamental contribution in their early research career. Table 5 presents the PR-index sequences, which shows that Han's research career is the longest. Most of these authors started their impact research in Data Mining from 1995 to 1997. To compare authors who have the same PR-index, we define their most productive 5 years (MP5), a measure that was introduced in Liang's paper [25] as follows: where y refers to the year and PR-index y refers to the PR-index of an author in year y. Table 5 lists the MP5 for each author. Appropriately, JW Han gets the maximum score. Moreover, SJ Stolfo receives the maximum score among 5 authors with the same PR-index (i.e., PRindex = 11). It can be concluded that Stolfo's efficiency is the greatest among these authors because he has the highest MP5.

Conclusions
In conclusion, this paper proposes a new variant of h-index, PR-index, and also discusses the features of such indicator. PR-index was developed based on both h-index and PageRank to evaluate an author's impact from an objective point of view. The core idea of PR-index is that it replaces the h-index's consideration of citation with the PageRank score. Using this modification, both the popularity and authority of each publication are considered. As has been shown in our experimental results, the PR-index is more reasonable than other author-impact assessment indicators when taking the SIGKDD innovation award as an evaluation criterion. Moreover, we have used a sequence analysis of PR-index to explore the evolution mechanism for the top authors by adopting Liang's method [25] and illustrating the PR-index sequence for each author. According to the statistical results, we found that the evolution mechanism of PR-index sequence is varied with authors. As for our future work, we will try to take a combination with PageRank and other variants of h-index to eliminate other shortcomings of h-index. What's more, we will explore the impact evolution of each author or institution and predict the future impact of each one.