It was recently reported that men self-cite >50% more often than women across a wide variety of disciplines in the bibliographic database JSTOR. Here, we replicate this finding in a sample of 1.6 million papers from Author-ity, a version of PubMed with computationally disambiguated author names. More importantly, we show that the gender effect largely disappears when accounting for prior publication count in a multidimensional statistical model. Gender has the weakest effect on the probability of self-citation among an extensive set of features tested, including byline position, affiliation, ethnicity, collaboration size, time lag, subject-matter novelty, reference/citation counts, publication type, language, and venue. We find that self-citation is the hallmark of productive authors, of any gender, who cite their novel journal publications early and in similar venues, and more often cross citation-barriers such as language and indexing. As a result, papers by authors with short, disrupted, or diverse careers miss out on the initial boost in visibility gained from self-citations. Our data further suggest that this disproportionately affects women because of attrition and not because of disciplinary under-specialization.
Citation: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773
Editor: Niels O. Schiller, Leiden University, NETHERLANDS
Received: September 23, 2016; Accepted: March 14, 2018; Published: September 26, 2018
Copyright: © 2018 Mishra et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Authors are unable to share the full dataset because it includes instances of citations (rows in the dataset) that were licensed from a third party: Thomson Reuters, now Clarivate Analytics. However, the authors have provided all the data (https://doi.org/10.13012/B2IDB-9665377_V1) and source code (https://github.com/napsternxg/PubMed_SelfCitationAnalysis) for analyzing a substantial portion of the full dataset based on citations in PubMedCentral which are freely available via the US National Library of Medicine’s E-Utilities. The authors also provided source code and instructions (https://github.com/napsternxg/PubMed_SelfCitationAnalysis) for recreating and analyzing the full dataset. This process involves the following 5 datasets available from the Illinois Data Bank and 2 third-party datasets: 1. Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 2. Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 3. Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 4. Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 5. Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 6. The 2015 MEDLINE/PubMed Baseline Distribution is available via ftp: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html 7. Web of Science citations for MEDLINE records is available for license from Clarivate Analytics: Clovis, Jeffrey (IP&Science): jeff.clovis@Clarivate.com and Beynon, Ann (IP&Science): ann.kushmerick@Clarivate.com.
Funding: This work was made possible in part with funding to VIT from National Institute on Aging grant P01AG039347 (https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490) and the Directorate for Education & Human Resources of the NSF grant 1348742 (http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: Vetle Torvik is a co-inventor of the Author-ity dataset, which is subject to licensing to third-parties (for non-academic or for-profit use) by the University of Illinois, which owns this intellectual property. However, the Author-ity dataset is provided free-of-charge for non-profit academic research. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.
Citing one’s own work is common practice, an essential part of scientific communication  that reflects the accumulative nature of research [2, 3], but it can be viewed negatively and is commonly discounted or penalized in impact metrics [4, 5]. The bibliometrics literature is rich in barriers and motivations for citation preferences [6–8], many of which overlap for self-citations  and vary e.g., by collaboration size . Yet, what encourages authors to self-cite is not well understood. A recent study in particular [11–13] has gained high profile attention [14–17] in reporting that men self-cite >50% more often than women, an effect that is consistent across a wide variety of disciplines and has grown over time, reaching its peak of 70% in recent years. These effects are indeed jaw-dropping, align with gender stereotypes [18, 19], and imply that women may be severely disadvantaged because self-citations are not just additive but also attract other citations . King et. al  offer a list of five possible explanations and “…consider it likely that several may contribute to the gender gap…” but are unable to test any of them for JSTOR. However, they confirm that, number 4 on their list, “Men publish more papers, particularly earlier in their career, and therefore have more to cite.” explains the gender-gap in its entirity for the Social Science Research Network (SSRN) pre-print repository, with the caveat that SSRN authors “…are not representative of academics, more generally.”.
It is well known that gender is often subject to confounding as in the pay-gap [21, 22] and productivity-gap , and that gender distributions in science vary dramatically across subject matter , geography , career length, position and productivity , and time [27, 28]. The effect of gender on self-citation could become negligible or even reverse after accounting for confounding factors. The reversal of an effect would be evidence of Simpson’s paradox [29, 30], of which the study of gender bias in UC Berkeley’s graduate admissions serves as a classic example . A case study of archeology  found that author age has a strong effect on self-citation while gender is weak, in a carefully crafted sample of 285 articles analyzed using linear regression. Furthermore, the methods for citation parsing , identifying self-citations, and imputing gender are nontrivial and introduce errors, if not bias, into the overall analysis. For example, King et al.’s  analysis is US-centric (e.g., all English and Italian Andrea’s are labeled female while all Shubhanshu’s and Vetle’s are excluded), and Larivière et al.  reported highly assymetric error rates (≈ 13% errors for their female labels versus ≈ 1% for male labels) in an effort to cover names worldwide.
Here, we try to replicate this gender effect in PubMed, covering biomedicine broadly, while accounting for potential confounds and assessing the robustness of recently developed techniques for worldwide gender imputation [27, 34, 35]. We present a probabilistic model of self-citation based on over 1.6 million papers with 2 or more authors in PubMed during 2002-2005, with nearly 41.6 million citations of which 5.5 million (13.2%) are self-citations by at least one of the authors. The model captures the extent to which gender influences self-citations while controlling for other features of the papers and their disambiguated authors [36, 37].
Materials and methods
Two citation datasets with imputed gender: First and last authors
The contributions of each author on a multi-author paper systematically vary with their byline positions. Typically the first-listed author has less experience and does the most work, while the last-listed author has more experience and acts in a supervisory role, particularly in biomedicine. In order to account for this variability, we created two separate datasets: one for all first-listed authorships and one for all last-listed authorships. Each dataset is based on the 1.6 million PubMed papers with 2 or more authors published during 2002-2005 and with one or more references extracted from PubMed Central, Thomson Reuters’ Web of Science MEDLINE version, or Microsoft Academic Graph. The temporal restriction greatly improves gender imputation and author name disambiguaton because the National Library of Medline (NLM) started recording authors’ first names in MEDLINE in 2002. Overall, each dataset contains 41.6 million citation instances, of which 2.0 million (4.9%) are self-citations by the first author and 3.6 million (8.6%) are self-citations by the last author. There are 824 thousand unique first authors (with an average of 1.9 papers/author) and 539 thousand unique last authors (with an average of 3.0 papers/author) according to the Author-ity 2009 dataset, which has an overall disambiguation accuracy of 98% [36–38]. Disambiguation helps distinguish self-citations from other citations (even when an author has published under variant names, and when different authors with the same name cite each other), characterize each author’s publication history, and it improves the coverage of gender imputations because their first name need only be present on one of their papers. This paper focuses on first and last authors but includes a partial analysis of a third, slightly smaller dataset for middle authors, represented by all second-listed authors on the subset of papers with three or more authors, to confirm that the overall results do not differ significantly.
Gender is imputed as Male, Female, or Unknown using Genni 2.0 + Ethnea [27, 34, 35] which covers names worldwide and is ethnicity-sensitive. That is, it can make accurate predictions even for names that are rare in the USA but common elsewhere (such as Shubhanshu and Vetle) and it makes use of Ethnea’s ethnicity prediction to resolve genders for some names that are unisex worldwide but gender-specific regionally (e.g., English Andrea’s are labeled female and Italian Andrea’s are labeled male). Table 1 shows the improved coverage and potential bias-reduction compared to US Social Security Administration (SSA) data based gender assignment across a variety of ethnicities. Only the top 13 most common ethnicities (and UNKNOWN) are included while the remaining ethnicities were pooled and labeled OTHER. Overall, Genni and SSA predictions rarely disagree except for French and Italian names, where they disagree on more than 3% of authorships. As expected, Genni has higher coverage (e.g., 88.6% vs. 74.6% of all last authorships), more so for non-English names such as Nordic (96.3% vs. 65.7%) or Indian (68.3% vs. 45.0%) and less so for English names (95.4% vs. 91.7%). The lack of overall coverage is largely due to a large number of Chinese names that are difficult to classify (67.2% of 150k Chinese first authorships are labeled Unknown). Genni provides a slightly lower estimate of the overall Female proportion (33.0% of first authorships and 19.2% of last authorships) compared to SSA’s (34.8% of first authorships and 20.7% of last authorships). However, these proportions vary dramatically across byline position and ethnicities (ranging from 6.1% Japanese Female last authorships to 44.5% Slav Female first authorships). Taken together, this suggests that ethnicity and byline position are important covariates of gender in bibliometric studies broadly. In the spirit of data sharing and encouraging reproducibility, the Author-ity Exporter  web-interface was created to permit any user to search, browse, and export data from the annotated authors and papers in the Author-ity dataset. Furthermore, the subset of data based on PubMed Central is also shared .
U denotes the percentage of authorships labelled Unknown, %F denotes the percentage of female authorships among male and female authorships, and G = SSA denotes the percentage of male and female SSA predictions that match the Genni predictions.
Explanatory features modeled
Each observation (instance) in the datasets captures features about a particular citation and the paper in which it appears. More concretely, features include aspects of (a) a given paper, (b) the paper it cites, (c) similarity or nearness between the paper and the cited paper, and (d) the first or last author, namely professional age, gender, ethnicity, and country of affiliation (inferred using MapAffil [41, 42]). The features used for modeling are listed in Table 2. The distribution of instances in the two datasets for a select group of categorical features are shown in Table 3.
Several of these features capture known motivations for self-citation narrowly, and citation broadly: (a) prior citations: authors tend to cite papers that have previously received citations; (b) time: one cannot (typically) cite papers that do not yet exist and self-citations might appear sooner ; (c) publication count: an author cannot self-cite if they don’t have any published (or working) papers; (d) language : one is less likely to cite papers that one cannot understand ; (e) disciplinary barriers and topical diversity (encoded by indicating whether the article and its reference were published in the same or similar journal as captured by the exact name match and the implicit journal score  which is similar to author odds ratio ): academic careers often depend on intra- vs. inter-disciplinary citations and scientists who jump from one topic to another are less likely to cite themselves; (e) accessibility and discoverability: what one cites may be limited by how easy it is to find and obtain physically ; (f) publication type : one may cite writing, not necessarily research; literature reviews tend to cited more often; (g) novelty or topical narrowness: articles on young topics tend to be cited more often ; (h) collaboration size: as the number of co-authors increases, the individual opportunity for self-citation may decrease.
Simple characterizations of age-normalized self-citation rates
Before presenting the full model, we start with simple characterizations of self-citation as a function of author age and gender using focused subsets of the datasets. First, the data are restricted to papers with 10 to 60 references, which will focus the characterization on primary research articles and exclude papers with unusual citation patterns such as short comments and letters, as well as long review papers. Second, the data are restricted to instances where the gender is known (labeled Male or Female). This leaves 26.2 million first author references, and 27.5 million last author references.
The bottom panels in Fig 1 show that women tend to be younger (as measured by prior publication count) both as first-listed and last-listed authors, and thus have fewer opportunities for self-citation. The top panels in Fig 1 show the overall relationships between self-citation rate and author age where the self-citation rate is modeled as a logistic regression function of gender and author age (pub count). The horizontal lines show that overall self-citation rates differ dramatically between women and men. Men self-cite 46% more often than women as first authors (5.79% vs. 3.95%), and 27% more often as last authors (9.93% vs. 7.83%). However, the logistic regression curves that account for authors’ prior publication counts are nearly identical. It appears that the age-normalized self-citation rate among men is about 1.9% (SE = 0.2%) higher as first authors, but 2.1% (SE = 0.2%) lower as last authors. However, these differences are within the error rates of the techniques used to disambiguate author names and predict gender, so we cannot say for sure that these differences are real. This preliminary characterization reveals that there are more important factors than gender that govern self-citation.
The horizontal lines show the overall self-citation rates. The bottom panels show the cumulate distributions of author age.
S1 Fig shows that the gender effect for middle authors, as represented by the second author on papers with three or more authors, is similar to that of last authors. The age-normalized self-citation rate among men is about 2.2% (SE = 0.2%) lower than among women. Middle authors self-cite less overall (4.67% for men and 3.32% for women) compared to first and last authors. However, this effect can be explained, at least partly, by the fact that the first and last author datasets include two-author papers, where they tend to self-cite more. Excluding the two author papers reduced the dataset from 1.6 to 1.3 million papers and 41.6 to 34.0 million references.
For a secondary part of this simple characterization, the top-most prolific journals were selected (each with at least 20,000 references). These journals capture variations within and between three broad subject areas: general science, biology, medicine as shown in Table 4. In general, the journal-subsetting yields bigger effects but much less statistical significance. It should also be noted that the sample sizes appear large because each reference is counted separately. The total number of articles is much smaller than the total number of references (a typical paper has about 30 references), and the number of unique female (or male) authors of a given age is a fraction of the number articles. In other words, the degrees of freedom are inflated and the reported p-values are likely underestimated. The high variability between observed proportions versus model expectations is illustrated in the S2 Fig (science journals), S3 Fig (biology journals), and S4 Fig (medicine journals). After adjusting for age, men tend to self-cite more as first authors but less as last authors, in the majority of the journals, a finding that is consistent with Fig 1. The majority individual journal effects are statistically insignificant (with p-value > 0.05), particularly among medical journals. After adjusting for age, there is one journal where men self-cite more as both first and last authors (Biochem J), and one journal (Cancer Res) where women self-cite more as both first and last authors. These effects may be the result of additional confounding factors not considered in this preliminary analysis but are presented next.
A probabilistic model of self-citation
The observed citations were modeled using logistic regression where denote the explanatory features described in Table 2 and the binary outcome indicates whether or not an observation is a self-citation. The use of logistic regression as the modeling framework is justified by the fits shown in S5 Fig, particularly the linearity assumption. Local intercepts, indicators, and polynomials were introduced to capture non-linearities and facilitate a closer fit to the data. The complete models of self-citation for first authors and last authors are detailed in Table 5. Although the model is inadequate for predicting whether any given citation was a self-citation or not, given its low precision and recall (S6 Fig), prediction was not the aim of this study. Rather, the purpose is to measure the degree to which each of the encoded features influence self-citations, as captured by the effects (or weights) listed in this table. Note that one of columns includes simple models of individual categories for first authors, to be compared to the full first author model so that the degree of confounding can be assessed.
A few additional aspects of the data and its encoding as well as model specification and model interpretation deserve mention:
- Because language, ethnicity, and affiliation are tightly correlated, given the multi-ethnic composition of modern societies, our approach to language is simplistic. For example, we test only whether the language of the citing or cited papers are English, including translations. Thus, the results of this study cannot be the basis for a claim that one ethnicity self-cites more than another, although the results may offer clues about culture categorically. Among other features, journal similarity captures author links between journals; the higher the score, the more likely self-citation, reflecting a broad rather than narrow pattern of citation.
- The novelty of an article or reference is determined by the relative frequency of its associated MeSH terms [44, 45]. The novelty scores score and ref. score follow a similar encoding, where lower scores indicate more novel papers.
- Due to the indexing policies of the National Library of Medicine (NLM) about PubMed for the period considered, only the first author’s affiliation is retained and, for this study, assigned to the last author as well. Thus, for the model of last authors, affiliation (country) and ethnicity, which is dependent to a degree on affiliation, is less accurate and not as tightly coupled with ethnicity and language as for the model of first authors.
A gender-effect does exist but its magnitude and relative importance depend on what other factors a model includes (see Fig 2). For first authors, when considering gender only, males self-cite ≈ 54% more often than females, a percentage congruent with the findings of [11, 51]. However, when controlling for other factors, the difference becomes negligible (Table 5; exp(0.451 − 0.021) = 1.54 vs. exp(−0.04 + 0.03) = 0.99). Sign flips between the individual model of gender and the complete model provides evidence of extreme confounding (Simpson’s paradox), which means that conclusions drawn from analysis of the individual model are unreliable [29, 30].
The sub-figures show the contribution of gender at each step in the iterative process of fitting and evaluating combinations of factors; only the model at the final step is the best-fitting among them. In both models, confounding factors ultimately minimize the effect of gender in self-citation; the most influential of them is author’s publication count (note Table 6). Y-axis is on log scale.
In the presence of confounding factors, the models show that gender has little affect on self-citation after-all. Gender is the least explanatory of all the features modeled for both first and last authors when factors are ranked by contribution (Table 6). For first authors, a plot of the self-citation odds-ratio of male and female gender with respect to the unknown gender shows that the known gender’s influence effectively disappears after inclusion of (a) prior citation count of the reference before being cited by the paper, (b) author’s publication count before source paper’s published year, and (c) publication type; for last authors, time lag (the number of years between the source and the cited paper). Factoring in author’s publication count results in the largest reduction of the gender effect (Fig 2). Moreover, the relationship between male and female self-citing behavior is effectively identical when affiliation is added for first authors (step 7) and when journals is added for last authors (step 4); the relationship reverses and diverges thereafter, suggesting that females self-cite more often than males when all else is equal. However, the effect is negligible and probably below the level detectable using our methods.
The best-performing model at each step is the one with the largest log-likelihood (LL); only the highest-ranking of which are shown in steps 2 and following. Models comprise the predictors from the best-performing models in all previous steps along with the newly added category indicated by the plus sign (+). AUC (Area Under the receiver operating characteristic Curve), given as a percentage, roughly measures the accuracy of estimated probabilities. The number of terms in the model is denoted by nf, excluding intercept.
Factors affecting self-citation
The features that explain the most self-citation in both first author and last author models have more to do with opportunity, accessibility, and visibility than gender and culture (ethnicity, language, affiliation). The former aspects are illustrated in Fig 3, while the latter are illustrated in Fig 4, using related coefficients from the complete models detailed in Table 5.
Shaded regions indicate 95% confidence intervals. Y-axis is on log scale.
Error bars indicate 95% confidence intervals. Among other interesting points, note that the likelihood of self-citation is least for last authors with non-USA affiliation, implying that self-citing is customary among USA authors. X-axis is on log scale.
Opportunity is a major factor driving self-citation. An author cannot legitimately self-cite without having produced work to cite; and a paper on a novel topic will have fewer papers to reference. The choice among citeable papers is bounded. Thus, the more self-authored papers one has available, the more opportunity one has for self-citation. The model captures this effect. Going from none to one prior paper increases the odds of self-citation 4-fold, while the increase is 5-fold going from 1 to 10 prior papers (Fig 3 and Table 5). Note that it is possible to self-cite with no prior papers, e.g., when the cited paper is published simultaneously or in the same year. Although rare, some cited papers are even published in future years e.g., due to extensive journal back-logs or review processes. When a paper has few references, each is more likely to be a self-citation compared to those with normal or a high number of references. Furthermore, the odds of self-citation are highest for first and last authors on 2-author papers; suggesting some limits in self-citing in the presence of more authors on the source paper.
Accessibility and visibility are surely driving forces of citations generally, and as such, might counteract self-citations. After all, our model is designed to distinguish self-citations from citations by others. The journal publication type has a 40% higher odds of self-citation than a review, while a reference to a journal article is 80% more likely to be a self-citation than a review article (see Fig 4). One is not likely to cite work that one cannot understand (due to language or discipline) or that is otherwise inaccessible (due to indexing issues or economic barriers such as pay-walls or firewalls). Culture and language both act as a barriers in this regard, limiting what can be cited by a particular audience. Language also reflects social norms, which are much more difficult to discern but tightly correlated with ethnicity and geography. A non English reference has 20% higher odds of being a self-citation, while an English paper has 90% higher odds of including a self-citation. The odds of self-citation for a first author in Japan are 30% higher than that in USA, while the odds for an first author in Italy are 10% less. Similarly, a first author with a Japanese name has 10% lower odds of self-citation compared to an author with an English name, while an author with an Italian name has 6% lower odds. Authors of with Nordic names self-cite more than any other ethnicity, regardless of gender—a behavior that would contravene Janteloven  if construed as self-promotion or an attempt to “game the system”; but perhaps language is a barrier for the Nordic too.
Citations increase the visibility of a paper and presumably the visibility of its author(s). This is self-evident; and so the result that prior citations of the reference is the most influential in the first author model is unsurprising. For first authors, a reference paper published in the same year as the paper citing it increases the odds of self-citation by 70%; for last authors, the same publication year doubles the odds. The effect of a multi-year difference is strongly tempered, meaning only recent references are more likely to be self-citations. What makes these particular results interesting is that, for both first and last authors, references with any prior citations decrease the odds of self-citation, more for first authors than last authors. Thus, for both first and last authors, the odds of self-citation increases when the reference is recent and has few if any citations. This corroborates the finding of Aksnes et. al , which states that most of the early citations of an article are self-citations; and might point towards the opportunistic nature of authors of bootstrapping the initial citations of their prior work through means of self-citations. The effect of gender is reduced when accounting for these additional factors, as evident from the dramatic change in odds of self-citation (Fig 2). This probabilistic model suggests that self-citation is more about opportunity, accessibility, and visibility than culture or gender. These results in the self-citation model vary only sightly when the analysis is conducted on a filtered data set where the country of affiliation is USA, ethnicity is ENGLISH, source language is ENGLISH, number of references is at least 10 (and at-max 100 for first authors and 50 for last authors), and source publication type is JOURNAL. The coefficient for female gender remains same, while the coefficient for male authors increases slightly for both first and last authors while becoming non-significant (p >0.05). This indicates that the results are robust under the consideration of the largest portion of the data set (See Table 7).
The takeaways from the present study are threefold. First, models that lack sufficient controls jeopardize the conclusions drawn from them, potentially with adverse effects on public perception and public policy. Human social behavior is complex and, thus, unlikely to be explained adequately without a diverse set of controls. The concern becomes clear when we consider the dramatic change in gender’s effect in this study with the introduction of confounding factors. MacRoberts et al.  asserted the following: “Today, in spite of an overwhelming body of evidence to the contrary, citation analysts continue to accept the traditional view of science as a privileged enterprise free of cultural bias and self-interest and accordingly continue to treat citations as if they were culture free measures.”. If cultural bias and self-interest influence self-citations, they are perhaps more charitably regarded as aspects of opportunity, accessibility, and visibility. First and last author positions on a byline enhance the visibility of the authors who occupy them, even if the position itself indicates nothing about the relative contribution of the author . However, if we assume that authors in the most prominent positions on a byline have more influence over citations in the manuscript, we should not have difficulty imagining that particular positions offer increased opportunity for self-citation. From the results herein, the opportunities likely depend on a reference’s accessibility (physical, economic) as well as the degree to which the citation enhances the cited’s visibility (in terms of the work’s topical relevance, despite specialization, or the author’s prominence).
Second, self-citation is the hallmark of highly productive authors, of any gender, who cite their novel journal publications early in similar venues. As a result, papers by authors with short, disrupted, or diverse careers lack the initial boost in visibility gained from self-citations. This disproportionately affect women who tend to leave (and enter) science at higher rates than men (see Table 8), and have different career trajectories for a variety of reasons . Prior work has found that men specialize more than women and therefore can publish more . However, in our dataset we find that men and women have the same age-normalized expertise, on average (Fig 5). In other words, attrition is the most likely driver of overall differences in self-citation rates, not topical specialization or citation strategies. Reducing attrition and achieving a more gender-balanced scientific workforce  is essential to improving scientific progress. It should help increase self-citation, a signal that more research is followed-up by the scientists best situated to do so.
Note that career start and end years were determined based on the full 2009 Author-ity dataset.
Expertise of an author on a given paper is measured by the proportion of subjects (MeSH; a paper typically has a dozen or so terms) on which the author has previously published. Expertise naturally grows with age but never reaches 100% because authors tend to publish on some topics that are new to them.
Third, the scientometric practice of discounting self-citations has the adverse effects of penalizing so-called lines of research and reducing the impact of some important papers already suffering from lower discoverability and accessibility because of non-English language, lack of bibliographic indexing, and so on. Authors most likely to be affected by such a penalty are perhaps the most dependent on sponsorship and patronage in science [58, 59]. Citations reflect potentially many different authorial attitudes: to credit the source of inspiration; to aid the understanding of the reader; to assert authority in a field ; but self-citations acknowledge an individual’s line of research. One might worry about “salami-slicing”  and disciplinary “siloing”  but one should also hope that scholars conduct research in the way that is effective for them individually. Tempering labels like being “your own favorite expert”  would be a good start. After all, a persistent lack of (self-)citations likely reflects dead-ends and orphans not even nurtured by the scholars with a vested interest.
S1 Fig. Self-citation rates as functions of author age as measured by prior publication count (top panels) for the second-listed authors on papers with three or more authors.
The horizontal lines show the overall self-citation rates. The bottom panels show the cumulate distributions of author age.
S2 Fig. Self-citation rates as functions of author age for journals from science category.
The horizontal lines show the overall self-citation rates.
S3 Fig. Self-citation rates as functions of author age for journals from biology category.
The horizontal lines show the overall self-citation rates.
S4 Fig. Self-citation rates as functions of author age for journals from medicine category.
The horizontal lines show the overall self-citation rates.
S5 Fig. Plots of empirical vs. fitted values for select predictors from the complete models for (a) first and (b) last author data.
For each plot, empirical data are represented by bubbles, the size of which are proportional to the number of data points each contains; bubble color reflects the number of actual self-citations denoted in the accompanying legend. Red lines show the fit for a predictor given all terms in the complete model. The alignment of bubbles and lines provides evidence that the chosen modeling framework (logistic regression, a linear model) is appropriate.
S6 Fig. Receiver Operating Curve (ROC) and Precision Recall Curves (PRC) for complete first and last author models.
A model that fit the data perfectly would hug the upper left and upper right corners of the ROC and PRC plots, respectively. A model no better than random would hug the thin gray diagonal line.
Research reported in this publication was supported in part by the National Institute on Aging of the NIH (Award Number P01AG039347) and the Directorate for Education & Human Resources of the NSF (Award Number 1348742). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF. The authors would also like to thank Daniel Maliniak, Jevin West, and other anonymous reviewers who gave valuable feedback on the earlier drafts of this work.
- 1. Wolfgang G, Bart T, Balázs S. A bibliometric approach to the role of author self-citations in scientific communication. Scientometrics. 2004;59(1):63–77.
- 2. Aksnes DW. A macro study of self-citation. Scientometrics. 2003;56(2):235–246.
- 3. Costas R, van Leeuwen TN, Bordons M. Self-citations at the meso and individual levels: Effects of different calculation methods. Scientometrics. 2010;82(3):517–537. pmid:20234766
- 4. Hirsch JE. An index to quantify an individual’s scientific research output. Proc Natl Acad Sci U S A. 2005;102(46).
- 5. Ioannidis JPA. A generalized view of self-citation: Direct, co-author, collaborative, and coercive induced self-citation. Journal of Psychosomatic Research. 2015;78(1):7–11. pmid:25466321
- 6. Smith LC. Citation analysis. Library Trends. 1981;30(1):83–106.
- 7. Garfield E. Citation indexing: Its theory and application in science, technology, and humanities. vol. 8. Wiley New York; 1979.
- 8. Tahamtan I, Afshar AS, Ahamdzadeh K. Factors affecting number of citations: A comprehensive review of the literature. Scientometrics. 2016;107(3):1195–1225.
- 9. Hartley J. To cite or not to cite: Author self-citations and the impact factor. Scientometrics. 2012;92(2):313–317.
- 10. Glänzel W, Debackere K, Thijs B, Schubert A. A concise review on the role of author self-citations in information science, bibliometrics and science policy. Scientometrics. 2006;67(2):263–277.
- 11. King MM, Correll SJ, Jacquet J, Bergstrom CT, West JD. Men set their own cites high: Gender and self-citation across fields and over time. In: American Sociological Association Annual Meeting; 2015. Available from: http://convention2.allacademic.com/one/asa/asa/.
- 12. King MM, Bergstrom CT, Correll SJ, Jacquet J, West JD. Men set their own cites high: Gender and self-citation across fields and over time; 2016. Available from: http://arxiv.org/abs/1607.00376.
- 13. King MM, Bergstrom CT, Correll SJ, Jacquet J, West JD. Men Set Their Own Cites High: Gender and Self-citation across Fields and over Time. Socius: Sociological Research for a Dynamic World. 2017;3:237802311773890.
- 14. Wilson R. Lowered Cites. Chronicle of Higher Education. 2014;60(27):A24–A28.
- 15. Benderly BL. Men have a greater tendency to cite themselves, study says. Science. 2015;
- 16. Chawla DS. Men cite themselves more than women do. Nature. 2016;535. pmid:27414239
- 17. Ingraham C. New study finds that men are often their own favorite experts on any given subject. The Washington Post. 2016;.
- 18. Rudman LA. Self-promotion as a risk factor for women: the costs and benefits of counterstereotypical impression management. Journal of personality and social psychology. 1998;74(3):629. pmid:9523410
- 19. Moss-Racusin CA, Rudman LA. Disruptions in women’s self-promotion: the backlash avoidance model. Psychology of Women Quarterly. 2010;34(2):186–202.
- 20. Fowler JH, Aksnes DW. Does self-citation pay? Scientometrics. 2007;72(3):427–437.
- 21. Blau FD, Kahn LM. The Gender Wage Gap: Extent, Trends, and Explanations. National Bureau of Economic Research; 2016. 21913. Available from: http://www.nber.org/papers/w21913.
- 22. Buffington C, Cerf B, Jones C, Weinberg BA. STEM Training and Early Career Outcomes of Female and Male Graduate Students: Evidence from UMETRICS Data Linked to the 2010 Census. The American economic review. 2016;106(5):333–338. pmid:27231399
- 23. Cameron EZ, White AM, Gray ME. Solving the Productivity and Impact Puzzle: Do Men Outperform Women, or are Metrics Biased? BioScience. 2016;66(3). https://doi.org/10.1093/biosci/biv173
- 24. West JD, Jacquet J, King MM, Correll SJ, Bergstrom CT. The role of gender in scholarly authorship. PloS one. 2013;8(7):e66212. pmid:23894278
- 25. Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR. Bibliometrics: Global gender disparities in science. Nature. 2013;504(7479):211–213. pmid:24350369
- 26. Rørstad K, Aksnes DW. Publication rate expressed by age, gender and academic position—A large-scale analysis of Norwegian academic staff. Journal of Informetrics. 2015;9(2):317–333.
- 27. Smith BN, Singh M, Torvik VI. A Search Engine Approach to Estimating Temporal Changes in Gender Orientation of First Names. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL’13; 2013. p. 199–208.
- 28. Sugimoto CR, Ni C, West JD, Larivière V. The academic advantage: Gender disparities in patenting. PloS one. 2015;10(5):e0128000. pmid:26017626
- 29. Simpson EH. The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society Series B (Methodological). 1951;13(2):238–241.
- 30. Blyth CR. On Simpson’s Paradox and the Sure-Thing Principle. Journal of the American Statistical Association. 1972;67(338):364–366.
- 31. Bickel PJ, Hammel EA, O’Connell JW. Sex Bias in Graduate Admissions: Data from Berkeley. Science. 1975;187(4175):398–404. pmid:17835295
- 32. Hutson SR. Self-Citation in Archaeology: Age, Gender, Prestige, and the Self. Journal of Archaeological Method and Theory. 2006;13(1):1–18.
- 33. Councill IG, Giles CL, Kan MY. ParsCit: an Open-source CRF Reference String Parsing Package. In: LREC. vol. 8; 2008. p. 661–667.
- 34. Torvik VI, Agarwal S. Ethnea—an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database; 2016. Available from: http://hdl.handle.net/2142/88927.
- 35. Torvik VI. Genni + Ethnea for the Author-ity 2009 dataset; 2018. Available from: https://doi.org/10.13012/B2IDB-9087546_V1.
- 36. Torvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. ACM Trans Knowl Discov Data. 2009;3(3):1–29.
- 37. Torvik VI, Smalheiser NR. Author-ity 2009—PubMed author name disambiguated dataset; 2018. Available from: https://doi.org/10.13012/B2IDB-4222651_V1.
- 38. Torvik VI, Weeber M, Swanson DR, Smalheiser NR. A probabilistic similarity metric for MEDLINE records: A model for author name disambiguation. J Am Soc Inf Sci Technol. 2005;56(2):140–158.
- 39. Tuomela MS, Fegley BD, Torvik VI. Introducing the Author-ity Exporter, and a case study of geo-temporal movement of authors. In: METRICS Workshop ASIST Annual Meeting 2016; 2016. Available from: http://hdl.handle.net/2142/91612.
- 40. Mishra S, Fegley BD, Diesner J, Torvik VI. Self-citation analysis data based on PubMed Central subset (2002-2005); 2018. Available from: https://doi.org/10.13012/B2IDB-9665377_V1.
- 41. Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine. 2015;21(11/12). pmid:27170830
- 42. Torvik VI. MapAffil 2016 dataset—PubMed author affiliations mapped to cities and their geocodes worldwide; 2018. Available from: https://doi.org/10.13012/B2IDB-4354331_V1.
- 43. MEDLINE/PubMed Data Element (Field) Descriptions;. https://www.nlm.nih.gov/bsd/mms/medlineelements.html.
- 44. Mishra S, Torvik VI. Quantifying conceptual novelty in the biomedical literature. D-Lib Magazine. 2016 in press;22(11/12). pmid:27942200
- 45. Mishra S, Torvik VI. Conceptual novelty scores for PubMed articles; 2018. Available from: https://doi.org/10.13012/B2IDB-5060298_V1.
- 46. Torvik VI. Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009; 2018. Available from: https://doi.org/10.13012/B2IDB-4742014_V1.
- 47. D’Souza JL, Smalheiser NR. Three journal similarity metrics and their application to biomedical journals. PLoS ONE. 2014;9(12).
- 48. Bonzi S, Snyder HW. Motivations for citation: A comparison of self citation and citation to others. Scientometrics. 1991;21(2):245–254.
- 49. Garfield E. English–An International language for science. The Information Scientist, Dec. 1967;76:19–20.
- 50. Evans JA. Electronic Publication and the Narrowing of Science and Scholarship. Science. 2008;321(5887):395–399. pmid:18635800
- 51. Maliniak D, Powers R, Walter BF. The Gender Citation Gap in International Relations. International Organization. 2013;67:889–922.
- 52. Booth M. The almost nearly perfect people: The truth about the Nordic miracle. London, UK: Jonathan Cape; 2014.
- 53. MacRoberts MH, MacRoberts BR. Problems of citation analysis. Scientometrics. 1996;36(3):435–444.
- 54. Zuckerman HA. Patterns of name ordering among authors of scientific papers: A study of social symbolism and its ambiguity. American Journal of Sociology. 1968;74:276–291.
- 55. Ceci SJ, Ginther DK, Kahn S, Williams WM. Women in academic science A changing landscape. Psychological Science in the Public Interest. 2014;15(3):75–141. pmid:26172066
- 56. Leahey E. Gender differences in productivity: Research specialization as a missing link. Gender & Society. 2006;20(6):754–780.
- 57. Smith KA, Arlotta P, Watt FM, Solomon SL, Arlotta P, Bargmann C, et al. Seven Actionable Strategies for Advancing Women in Science, Engineering, and Medicine. Cell Stem Cell. 2015;16(3):221–224. pmid:25748929
- 58. Lorber J. Women physicians: Careers, status, and power. New York, NY: Tavistock; 1984.
- 59. Leahey E. Not by Productivity Alone: How Visibility and Specialization Contribute to Academic Earnings. American Sociological Review. 2007;72(4):533–561.
- 60. Norman G. Data dredging, salami-slicing, and other successful strategies to ensure rejection: twelve tips on how to not get your paper published. Advances in Health Sciences Education. 2014;19:1–5. pmid:24473751
- 61. Valenta A, Brooks I, Laureto R, Ramaprasad A. Breaking the silo. Using informatics to support clinical and translational science. Journal of healthcare information management: JHIM. 2006;21(4):15–17.