Table 1.
Comparison of gender proportions by using SSA data (with a 95% cut-off) versus Genni 2.0, aggregated by ethnicity.
U denotes the percentage of authorships labelled Unknown, %F denotes the percentage of female authorships among male and female authorships, and G = SSA denotes the percentage of male and female SSA predictions that match the Genni predictions.
Table 2.
Descriptions of all the explanatory features.
Table 3.
Distribution (in percentage) of 41.6 million references (from 1.6 million articles with 2 or more authors published during 2002-2005) across select categorical features.
Fig 1.
Self-citation rates as functions of author age as measured by prior publication count (top panels).
The horizontal lines show the overall self-citation rates. The bottom panels show the cumulate distributions of author age.
Table 4.
Gender effects for selected journals using a simple model with author age (pub count) only.
Table 5.
Models of self-citation behavior of first and last authors based on 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005.
Fig 2.
Change in effect of gender at each model-fitting step.
The sub-figures show the contribution of gender at each step in the iterative process of fitting and evaluating combinations of factors; only the model at the final step is the best-fitting among them. In both models, confounding factors ultimately minimize the effect of gender in self-citation; the most influential of them is author’s publication count (note Table 6). Y-axis is on log scale.
Table 6.
Fit statistics for individual and accretive models of self-citation based on 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005.
The best-performing model at each step is the one with the largest log-likelihood (LL); only the highest-ranking of which are shown in steps 2 and following. Models comprise the predictors from the best-performing models in all previous steps along with the newly added category indicated by the plus sign (+). AUC (Area Under the receiver operating characteristic Curve), given as a percentage, roughly measures the accuracy of estimated probabilities. The number of terms in the model is denoted by nf, excluding intercept.
Fig 3.
Change in odds with respect to mentioned values (in parentheses) of self-citation for select predictors of models of first and last authors.
Shaded regions indicate 95% confidence intervals. Y-axis is on log scale.
Fig 4.
Change in odds with respect to mentioned values of self-citation for select predictors of models of first and last authors.
Error bars indicate 95% confidence intervals. Among other interesting points, note that the likelihood of self-citation is least for last authors with non-USA affiliation, implying that self-citing is customary among USA authors. X-axis is on log scale.
Table 7.
Comparison of full model (based on all 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005) with filtered models (26.2 million references for first authors, and 27.5 million for last authors).
Table 8.
Percentage of authorships, on the 1.6 million articles with 2 or more authors published between 2002-2005, by authors who (a) started, (b) ended, and (c) started as well as ended their career in during period.
Note that career start and end years were determined based on the full 2009 Author-ity dataset.
Fig 5.
Author expertise as a function of prior publication count.
Expertise of an author on a given paper is measured by the proportion of subjects (MeSH; a paper typically has a dozen or so terms) on which the author has previously published. Expertise naturally grows with age but never reaches 100% because authors tend to publish on some topics that are new to them.