Predicting Scholars' Scientific Impact

We tested the underlying assumption that citation counts are reliable predictors of future success, analyzing complete citation data on the careers of scientists. Our results show that i) among all citation indicators, the annual citations at the time of prediction is the best predictor of future citations, ii) future citations of a scientist's published papers can be predicted accurately ( for a 1-year prediction, ) but iii) future citations of future work are hardly predictable.


Introduction
Many decisions with regard to the allocation of research funds and the assignment of positions are based on citation counts [1][2][3][4]. Citation counts are considered for awarding post-doctoral fellowships, assigning junior faculty positions and tenures [5][6][7][8][9][10]. However, it remains unclear whether citation-based indicators are appropriate measures to judge a scientist's future research quality [1,11].
In this study, we analyzed complete panel data on the careers of more than 150,000 scientists. Considering various metrics of research quality, we tested the assumption that citation counts are reliable predictors of future scientific success, as measured by future citations. Recent studies have partially measured the predictive power of several citation indicators for scientists' future citations [1,[11][12][13]. However, because of the limited availability of data these analyses are performed on a small population of scientists, and hence cannot establish with confidence the connection between past and future citations. ''There have been few attempts to discover which of the popular citation measures is best and whether any are statistically reliable'' and ''existing databases such as the ISI can therefore actively help to improve the situation by compiling field-specific homogeneous data sets similar to what we have generated for SPIRES'' [1].
We considered a range of bibliometric indicators to assess scientists' research quality. Productivity and impact are the two main dimensions of research quality [14][15][16][17]. Some indicators such as the number of published papers and the mean annual number of publications only reflect scientists' productivity. Citation-based indicators, on the other hand, are used to index impact both at the level of single publications [18][19][20] and over individuals' careers (for example a scientist's mean citation per paper, or total number of citations) [5,[21][22][23][24]. However, the probability of an article being cited depends on various factors (e.g. time, field, journal, availability of the article, authors'social network) [24][25][26][27].
Hirsch proposed the widely-used h index, which combines both productivity and impact [12]. A scientist's h index value is defined as the maximum Natural number h for which the scientist has h papers with at least h citations. This gives a lower bound of h 2 citations to the scientist. In comparison with the cumulative number of citations, the h index is not critically inflated by a small number of highly cited papers. In the same study, Hirsch defined the m index as a scientist's h index value divided by the time (years) elapsed from the first publication of the focal scientist [12].
The applicability of h to evaluate scientists has been heavily investigated in the literature [12,28,29]. High profile scientists (e.g. Nobel laureates and members National of Academy of Sciences) generally score higher h index values. Bornmann and Daniel tested its applicability to junior scientists and showed that the decision of a peer-review committee to award longterm fellowships favored those applicants with higher h index values [30].
A similar citation indicator that combines productivity and impact is the g index [31]. A scientist's g index value it the highest number g of papers that receives g 2 or more citations. By definition for every scientist g §h. The index inherits some good properties of the h index [32], The index has very different value than the h index for those who published few highly cited articles.

Results
We extracted citation information on the careers of *150,000 scientists from the Thomson Reuters Web of Science dataset. The careers comprise about 2 million papers and around 25 million citations of the papers since 1899. The number of papers per decade and the number of starting careers per decade are shows in Fig. 1. We used publication year, author list and list of references of the papers from the Thomson Reuters Web of Science dataset. Author names appeared as pairs of family name and initials (e.g. ''S Genoud''). For some of the more recent journals, full first names of authors were also provided. With our dataset, we therefore faced the name ambiguity problem, i.e. an initial may refer to more than one unique author, and an author may have more than one initial. Name ambiguity is a big hurdle in analyzing individual careers for which there exist no standard solution [33][34][35]. A method applicable to one dataset may not perform well for another.
In our study, instead of solving the complicated name ambiguity problem, we avoided it by discarding author names that appeared with different initials. For instance, because ''A Smith'' and ''B Smith'' both exist in our dataset of more than 124 million initials, we discarded family name ''Smith'', whereas family name ''Ambonati'' was selected because only one initial ''M'' was associated with it. This not only removes frequent family names, but also authors with different initials' spellings (e.g. ''A Smith'', ''AH Smith'', ''HA Smith'' may actually refer to the same author).
This procedure resulted in extracting more than one million family names associated with unique initials, for a total of about 3:6+ million entries. Nevertheless, a family name with a unique initial may still refer to at least two authors with different first names (e.g. both Marco Ambonati and Mario Ambonati have initial M). By analyzing the papers for which full first names were also provided, we estimated the probability of such cases to be 2:5%. There is also a miniscule probability that a family name with a unique initial and a unique first name belongs to at least two different authors. However, estimating this probability is impossible with our current data. We performed our analysis on more than 150,000 scientists whose career length, calculated as the time gap between the first and the last paper, was longer than 5 years. Our results were not sensitive to the minimum careerlength selection criteria.
The result of ambiguity removal procedure is demonstrated in Fig. 2. The most ambiguous family name (''Wang'') appeared in the author lists of about 640,000 papers, and obviously does not refer to a unique author (Fig. 2a). After the removal of ambiguous names, the maximum frequency of a last name with unique initial is 969 for the name ''S Oparil'', as shown in Fig. 2b. Moreover, the general statistics of the selected papers such as the mean number of authors per paper (5:2) or the mean number of references per papers (16:4) remained the same.
At every year y during a scientist's career, our goal is to estimate two quantities: a) the total citations received by her papers published until and including year y, in the k subsequent years ½yz1,yzk, and b) the citations of her papers published in the w subsequent years ½yz1,yzw, received in the k subsequent years ½yz1,yzk. For w~1 and k~2, for example, we estimated citations to the papers published in the year y received in the two years yz1 and yz2. Obviously, the time of prediction y varies between the publication year of her first to last paper (Fig. 3). Papers published before the time of prediction were treated as past papers and papers published afterwards as future papers. Obviously, future citations may refer to both past papers and future papers. Because the information about past citations of past papers is available at the time of prediction, estimating future citations of past papers is easier.
The information that we used in our model is the value of 10 prominent citation indicators at the time of prediction, namely the number of papers, the total number of citations, the career length, the average number of published papers per year, the average annual citations, the annual citations at the time of prediction, the average citations per paper, the h index, the m index, and the g index.
The prediction points were time-lagged according to w and k. For wƒk, every wz1 year we added a prediction point. For wwk the problem reduces to the case when w is equal to k. Because no paper published after the k-th year receives  citations within the first k years. The earliest prediction point was 5 years after the publication year of first paper. We therefore excluded the scholars with careers shorter than 5 years and the initial years (which may include graduate and PhD studies) of scholars with longer careers. This gave us between *143,000 (for 10-year predictions of *104,000 long careers) and *706,000 (for 1-year predictions of all careers) prediction points.
For example, suppose a scientist's first paper was published in 1990 and her last paper was published in 2003. Due to the nested structure of data (within-person time observations), we used multi-level regression models with random effects at the individual level. We implemented the models in ''STATA'' software using the ''xtreg'' function with the ''mle'' option. All variables were added in log scale.
More specifically, we estimated for scholar s the citations to a certain subset of his papers (selected by time-window w) in k subsequent years using citation indicators X~fx k g as where b k is the coefficient of citation indicator x k and a s½i is the intercept estimated for scholar s. Note that intercepts of this model are independently estimated for individual scholars (varying intercept model) and the number of data points for scholars are different. We then compare how well various sets of citation Figure 3. A schematic career for a scientist with 4 papers fp 1 ,p 2 ,p 3 ,p 4 g. We consider her career from her first paper p 1 . At prediction point y, we estimate the citations received in ½yz1,yzk of both past papers (p 1 and p 2 ), and of future papers published in ½yz1,yzw (p 3 ). Paper p 4 is a future paper which is not published in time-window w, and therefore excluded for the time-windows as defined by w and k. doi:10.1371/journal.pone.0049246.g003 Table 1. Explained variance of future citations estimated by the average number of citations per paper N c =N p (1st column), the h index (2nd column), the annual citations at the time of prediction C y (3rd column), and all the 10 indicators (4th column).  indicators X can estimate future citations c i by comparing the explained variance r 2 of the regression models with the same time horizons as defined by w and k.

Time windows Predictors
To estimate future citations, we considered the effectiveness of 10 prominent citation indicators, namely the number of papers, the total number of citations, the career length, the average number of published papers per year, the average annual citations, the annual citations at the time of prediction, the average citations per paper, the h index, the m index, and the g index. The future citations of past and future papers were estimated with multi-level regression models. We compared for various time horizons, the coefficient of determination between models with different predictors (citation indicators).
For various ks and ws , Table 1 compares how well the average citation per paper (N c =N p ), the h index and the annual received citations C y in the year of prediction y, and also all the 10 indicators can predict future citations.
First, we consistently found that the annual citations C y at the time of prediction y was the best predictor of future citations among the indicators ( Table 1), and that including the remaining 9 indicators increased the explained variance only by a small amount. The comparison between C y as a single predictor and all the 10 indicators (including C y ) are illustrated in Fig. 4. The model parameter values for various ks and ws with the single predictor C y are shown in Table 2).
Second, for past papers, C y explained 80% of the variance of future citations in the following year ½slope (b)~0:89, LR x 2 (df~1)~1:1|106, Pv0:001. As its prediction power decayed over longer time horizons, C y explained 65% of the variance of future citations of past papers for a 10-year prediction (b~1:31, LR x 2 (df~1)~146714:5, Pv0:001). When we added the remaining 9 indices, the explained variance increased from 80 to 83% for the 1-year prediction, and from 65 to 74% for the 10-year prediction. For short time horizons (k~1), the future citations of past papers are much better estimated by C y , than the h index or the average citation per paper (Table 1 for k~1 and k~3).
Third, the explained variance of future citations to future papers were very small in all the considered models. For the longest prediction horizon (w~10, k~10), where the citations received in ½yz1,yz10 to papers published in the same period are estimated, not more than 26% of variance was explained even when all the 10 indicators were included (see last row of Table 1). A similarly weak (21% explained variance) estimation was achieved when C y was the single estimator of our model. Estimating citations for shorter time horizons was generally harder. For the shortest prediction horizon w~1, k~1 for example (third row in Table 1), where the citations to papers published in year yz1 are estimated in the same year, only 10% of variance is explained when all the 10 citation indicators were added in the model. Likewise, only 9% of variance was explained by C y . The other citation indicators perform even worse if used as single estimator of our regression model.

Discussion
There is disagreement in the literature over the predictive power of the h index and that of the average number of citations per paper [1,11]. In agreement with Hirsch's study [11], we found that the h index is a better predictor for the future citations of both published papers and future papers ( Table 1). None of the studies, however, assessed C y , which we found to be the most powerful predictor of future citations. Discipline-wise analysis would require difficult choices in terms of classifying scholars and papers into disciplines. This classification requires extensive technical justifications, and we therefore reserve it for a future paper.
Our results have shown that the existing citation indices do not predict citations of future work well, and hence should not be given significant weight in evaluating academic potential. Including various indicators and testing various prediction time horizons, our results are still in agreement with Hirsch's study ''past performance is not predictive of future performance.'' [11]. Even combining multiple citation indicators did not significantly improve the prediction: apart from citation indicators, no better predictor of the impact of future work exists.

Acknowledgments
It is a pleasure to acknowledge Thomson Reuters for the use of Web of Science data, and thank Thomas Chadefaux, Michael Mä s, Thomas Grund, Steve Genoud, Karsten Donnay, and George Kampis for helpful comments.   (Model 1 and 2) and future papers (Model 3, 4, 5 and 6) at the time of prediction as estimated by the annual citations at the time of prediction.