Self-citation is the hallmark of productive authors, of any gender

It was recently reported that men self-cite >50% more often than women across a wide variety of disciplines in the bibliographic database JSTOR. Here, we replicate this finding in a sample of 1.6 million papers from Author-ity, a version of PubMed with computationally disambiguated author names. More importantly, we show that the gender effect largely disappears when accounting for prior publication count in a multidimensional statistical model. Gender has the weakest effect on the probability of self-citation among an extensive set of features tested, including byline position, affiliation, ethnicity, collaboration size, time lag, subject-matter novelty, reference/citation counts, publication type, language, and venue. We find that self-citation is the hallmark of productive authors, of any gender, who cite their novel journal publications early and in similar venues, and more often cross citation-barriers such as language and indexing. As a result, papers by authors with short, disrupted, or diverse careers miss out on the initial boost in visibility gained from self-citations. Our data further suggest that this disproportionately affects women because of attrition and not because of disciplinary under-specialization.


Introduction
Citing one's own work is common practice, an essential part of scientific communication [1] that reflects the accumulative nature of research [2,3], but it can be viewed negatively and is commonly discounted or penalized in impact metrics [4,5]. The bibliometrics literature is rich in barriers and motivations for citation preferences [6][7][8], many of which overlap for selfcitations [9] and vary e.g., by collaboration size [10]. Yet, what encourages authors to self-cite is not well understood. A recent study in particular [11][12][13] has gained high profile attention [14][15][16][17] in reporting that men self-cite >50% more often than women, an effect that is consistent across a wide variety of disciplines and has grown over time, reaching its peak of 70% in recent years. These effects are indeed jaw-dropping, align with gender stereotypes [18,19], and imply that women may be severely disadvantaged because self-citations are not just additive but also attract other citations [20]. King et. al [13] offer a list of five possible explanations and ". . .consider it likely that several may contribute to the gender gap. . ." but are unable to test PLOS  Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: Vetle Torvik is a coinventor of the Author-ity dataset, which is subject to licensing to third-parties (for non-academic or for-profit use) by the University of Illinois, which owns this intellectual property. However, the Author-ity dataset is provided free-of-charge for non-profit academic research. This does not alter includes a partial analysis of a third, slightly smaller dataset for middle authors, represented by all second-listed authors on the subset of papers with three or more authors, to confirm that the overall results do not differ significantly. Gender is imputed as Male, Female, or Unknown using Genni 2.0 + Ethnea [27,34,35] which covers names worldwide and is ethnicity-sensitive. That is, it can make accurate predictions even for names that are rare in the USA but common elsewhere (such as Shubhanshu and Vetle) and it makes use of Ethnea's ethnicity prediction to resolve genders for some names that are unisex worldwide but gender-specific regionally (e.g., English Andrea's are labeled female and Italian Andrea's are labeled male). Table 1 shows the improved coverage and potential bias-reduction compared to US Social Security Administration (SSA) data based gender assignment across a variety of ethnicities. Only the top 13 most common ethnicities (and UNKNOWN) are included while the remaining ethnicities were pooled and labeled OTHER. Overall, Genni and SSA predictions rarely disagree except for French and Italian names, where they disagree on more than 3% of authorships. As expected, Genni has higher coverage (e.g., 88.6% vs. 74.6% of all last authorships), more so for non-English names such as Nordic (96.3% vs. 65.7%) or Indian (68.3% vs. 45.0%) and less so for English names (95.4% vs. 91.7%). The lack of overall coverage is largely due to a large number of Chinese names that are difficult to classify (67.2% of 150k Chinese first authorships are labeled Unknown). Genni provides a slightly lower estimate of the overall Female proportion (33.0% of first authorships and 19.2% of last authorships) compared to SSA's (34.8% of first authorships and 20.7% of last authorships). However, these proportions vary dramatically across byline position and ethnicities (ranging from 6.1% Japanese Female last authorships to 44.5% Slav Female first authorships). Taken together, this suggests that ethnicity and byline position are important covariates of gender in bibliometric studies broadly. In the spirit of data sharing and encouraging reproducibility, the Author-ity Exporter [39] web-interface was created to permit any user to search, browse, and export data from the annotated authors and papers in the Author-ity dataset. Furthermore, the subset of data based on PubMed Central is also shared [40].

Explanatory features modeled
Each observation (instance) in the datasets captures features about a particular citation and the paper in which it appears. More concretely, features include aspects of (a) a given paper, (b) the paper it cites, (c) similarity or nearness between the paper and the cited paper, and (d) the first or last author, namely professional age, gender, ethnicity, and country of affiliation (inferred using MapAffil [41,42]). The features used for modeling are listed in Table 2. The distribution of instances in the two datasets for a select group of categorical features are shown in Table 3. Several of these features capture known motivations for self-citation narrowly, and citation broadly: (a) prior citations: authors tend to cite papers that have previously received citations; (b) time: one cannot (typically) cite papers that do not yet exist and self-citations might appear country is the country of affiliation of the first-listed author on the article in question, as inferred by MapAffil [41,42]. Each article is assigned one of the following countries: Australia, Canada, China, France, Germany, India, Italy, Japan, Netherlands, Other, Spain, Sweden, UK, Unknown, or USA.
collabortion size is the number of authors on the article in question, capped at 20.
language is 1 if the article was written in English as tagged in MEDLINE [43], 0 otherwise.
reference count is the total number of references listed on the article in question.
MeSH count is the number of MeSH terms (and all their unique ancestors in the MeSH tree structure) as assigned in MEDLINE.
novelty score is the number of prior papers for the youngest MeSH term assigned to the article in question (the so-called "volume novelty" in [44,45]).
pub type is the publication type(s) of the referenced article as tagged in MEDLINE [43]. Each article can have one or more of following: "journal article", "case report", "review article", and "letter/ editorial/comment".
venue is encoded by indicating whether the article and its reference were published in the same or similar journal as captured by the exact name match and the implicit journal score [46] which is similar to author odds ratio [47]. ref. novelty score is the number of prior papers for the youngest MeSH term assigned to the referenced article (the so-called "volume novelty" in [44,45]).

ref. pub type
is the publication type of the referenced article as tagged in MEDLINE [43]. Each article can have one or more of following: "journal article", "case report", "review article", and "letter/editorial/ comment". sooner [48]; (c) publication count: an author cannot self-cite if they don't have any published (or working) papers; (d) language [43]: one is less likely to cite papers that one cannot understand [49]; (e) disciplinary barriers and topical diversity (encoded by indicating whether the article and its reference were published in the same or similar journal as captured by the exact name match and the implicit journal score [46] which is similar to author odds ratio [47]): academic careers often depend on intra-vs. inter-disciplinary citations and scientists who jump from one topic to another are less likely to cite themselves; (e) accessibility and discoverability: what one cites may be limited by how easy it is to find and obtain physically [50]; (f) publication type [43]: one may cite writing, not necessarily research; literature reviews tend to cited more often; (g) novelty or topical narrowness: articles on young topics tend to be cited more often [44]; (h) collaboration size: as the number of co-authors increases, the individual opportunity for self-citation may decrease.

Simple characterizations of age-normalized self-citation rates
Before presenting the full model, we start with simple characterizations of self-citation as a function of author age and gender using focused subsets of the datasets. First, the data are restricted to papers with 10 to 60 references, which will focus the characterization on primary research articles and exclude papers with unusual citation patterns such as short comments and letters, as well as long review papers. Second, the data are restricted to instances where the gender is known (labeled Male or Female). This leaves 26.2 million first author references, and 27.5 million last author references. The bottom panels in Fig 1 show that women tend to be younger (as measured by prior publication count) both as first-listed and last-listed authors, and thus have fewer opportunities for self-citation. The top panels in Fig 1 show the overall relationships between self-citation rate and author age where the self-citation rate is modeled as a logistic regression function of gender and author age (pub count). The horizontal lines show that overall self-citation rates differ dramatically between women and men. Men self-cite 46% more often than women as first authors (5.79% vs. 3.95%), and 27% more often as last authors (9.93% vs. 7.83%). However, the logistic regression curves that account for authors' prior publication counts are nearly identical. It appears that the age-normalized self-citation rate among men is about 1.9% (SE = 0.2%) higher as first authors, but 2.1% (SE = 0.2%) lower as last authors. However, these differences are within the error rates of the techniques used to disambiguate author names and predict gender, so we cannot say for sure that these differences are real. This preliminary characterization reveals that there are more important factors than gender that govern self-citation. S1 Fig shows that the gender effect for middle authors, as represented by the second author on papers with three or more authors, is similar to that of last authors. The age-normalized self-citation rate among men is about 2.2% (SE = 0.2%) lower than among women. Middle authors self-cite less overall (4.67% for men and 3.32% for women) compared to first and last authors. However, this effect can be explained, at least partly, by the fact that the first and last author datasets include two-author papers, where they tend to self-cite more. Excluding the two author papers reduced the dataset from 1.6 to 1.3 million papers and 41.6 to 34.0 million references.
For a secondary part of this simple characterization, the top-most prolific journals were selected (each with at least 20,000 references). These journals capture variations within and between three broad subject areas: general science, biology, medicine as shown in Table 4. In general, the journal-subsetting yields bigger effects but much less statistical significance. It should also be noted that the sample sizes appear large because each reference is counted After adjusting for age, men tend to self-cite more as first authors but less as last authors, in the majority of the journals, a finding that is consistent with Fig 1. The majority individual journal effects are statistically insignificant (with p-value > 0.05), particularly among medical journals. After adjusting for age, there is one journal where men self-cite more as both first and last authors (Biochem J), and one journal (Cancer Res) where women self-cite more as both first and last authors. These effects may be the result of additional confounding factors not considered in this preliminary analysis but are presented next.

A probabilistic model of self-citation
The observed citations were modeled using logistic regression Prðis self citation ¼ 1Þ ¼ 1 1 þ e À b 0 À b 1 x 1 À b 2 x 2 :::À b n x n where x 1 ; x 2 ; :::; x n denote the explanatory features described in Table 2 and the binary outcome indicates whether or not an observation is a self-citation. The use of logistic regression as the modeling framework is justified by the fits shown in S5 Fig, particularly the linearity assumption. Local intercepts, indicators, and polynomials were introduced to capture non-linearities and facilitate a closer fit to the data. The complete models of self-citation for first authors and last authors are detailed in Table 5. Although the model is inadequate for predicting whether any given citation was a self-citation or not, given its low precision and recall (S6 Fig), prediction was not the aim of this study. Rather, the purpose is to measure the degree to which each of the encoded features influence self-citations, as captured by the effects (or weights) listed in this table. Note that one of columns includes simple models of individual categories for first authors, to be compared to the full first author model so that the degree of confounding can be assessed.
A few additional aspects of the data and its encoding as well as model specification and model interpretation deserve mention: • Because language, ethnicity, and affiliation are tightly correlated, given the multi-ethnic composition of modern societies, our approach to language is simplistic. For example, we test only whether the language of the citing or cited papers are English, including translations. Thus, the results of this study cannot be the basis for a claim that one ethnicity selfcites more than another, although the results may offer clues about culture categorically. Among other features, journal similarity captures author links between journals; the higher the score, the more likely self-citation, reflecting a broad rather than narrow pattern of citation.
• The novelty of an article or reference is determined by the relative frequency of its associated MeSH terms [44,45]. The novelty scores score and ref. score follow a similar encoding, where lower scores indicate more novel papers.
• Due to the indexing policies of the National Library of Medicine (NLM) about PubMed for the period considered, only the first author's affiliation is retained and, for this study, assigned to the last author as well. Thus, for the model of last authors, affiliation (country) and ethnicity, which is dependent to a degree on affiliation, is less accurate and not as tightly coupled with ethnicity and language as for the model of first authors.

Gender effect
A gender-effect does exist but its magnitude and relative importance depend on what other factors a model includes (see Fig 2). For first authors, when considering gender only, males self-cite % 54% more often than females, a percentage congruent with the findings of [11,51]. However, when controlling for other factors, the difference becomes negligible (Table 5;    paradox), which means that conclusions drawn from analysis of the individual model are unreliable [29,30].
In the presence of confounding factors, the models show that gender has little affect on selfcitation after-all. Gender is the least explanatory of all the features modeled for both first and last authors when factors are ranked by contribution (Table 6). For first authors, a plot of the self-citation odds-ratio of male and female gender with respect to the unknown gender shows  Self-citation is the hallmark of productive authors, of any gender ; only the model at the final step is the best-fitting among them. In both models, confounding factors ultimately minimize the effect of gender in self-citation; the most influential of them is author's publication count (note Table 6). Y-axis is on log scale.
https://doi.org/10.1371/journal.pone.0195773.g002 Table 6. Fit statistics for individual and accretive models of self-citation based on 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005. The best-performing model at each step is the one with the largest log-likelihood (LL); only the highest-ranking of which are shown in steps 2 and following. Models comprise the predictors from the best-performing models in all previous steps along with the newly added category indicated by the plus sign (+). AUC (Area Under the receiver operating characteristic Curve), given as a percentage, roughly measures the accuracy of estimated probabilities. The number of terms in the model is denoted by nf, excluding intercept.

First authors Last authors
Step Self-citation is the hallmark of productive authors, of any gender that the known gender's influence effectively disappears after inclusion of (a) prior citation count of the reference before being cited by the paper, (b) author's publication count before source paper's published year, and (c) publication type; for last authors, time lag (the number of years between the source and the cited paper). Factoring in author's publication count results in the largest reduction of the gender effect (Fig 2). Moreover, the relationship between male and female self-citing behavior is effectively identical when affiliation is added for first authors (step 7) and when journals is added for last authors (step 4); the relationship reverses and diverges thereafter, suggesting that females self-cite more often than males when all else is equal. However, the effect is negligible and probably below the level detectable using our methods.

Factors affecting self-citation
The features that explain the most self-citation in both first author and last author models have more to do with opportunity, accessibility, and visibility than gender and culture (ethnicity, language, affiliation). The former aspects are illustrated in Fig 3, while the latter are illustrated in Fig 4, using related coefficients from the complete models detailed in Table 5.
Opportunity is a major factor driving self-citation. An author cannot legitimately self-cite without having produced work to cite; and a paper on a novel topic will have fewer papers to reference. The choice among citeable papers is bounded. Thus, the more self-authored papers one has available, the more opportunity one has for self-citation. The model captures this effect. Going from none to one prior paper increases the odds of self-citation 4-fold, while the increase is 5-fold going from 1 to 10 prior papers (Fig 3 and Table 5). Note that it is possible to self-cite with no prior papers, e.g., when the cited paper is published simultaneously or in the same year. Although rare, some cited papers are even published in future years e.g., due to extensive journal back-logs or review processes. When a paper has few references, each is more likely to be a self-citation compared to those with normal or a high number of references. Self-citation is the hallmark of productive authors, of any gender Furthermore, the odds of self-citation are highest for first and last authors on 2-author papers; suggesting some limits in self-citing in the presence of more authors on the source paper.
Accessibility and visibility are surely driving forces of citations generally, and as such, might counteract self-citations. After all, our model is designed to distinguish self-citations from citations by others. The journal publication type has a 40% higher odds of self-citation than a review, while a reference to a journal article is 80% more likely to be a self-citation than a review article (see Fig 4). One is not likely to cite work that one cannot understand (due to language or discipline) or that is otherwise inaccessible (due to indexing issues or economic barriers such as pay-walls or firewalls). Culture and language both act as a barriers in this regard, limiting what can be cited by a particular audience. Language also reflects social norms, which are much more difficult to discern but tightly correlated with ethnicity and geography. A non English reference has 20% higher odds of being a self-citation, while an English paper has 90% higher odds of including a self-citation. The odds of self-citation for a first author in Japan are 30% higher than that in USA, while the odds for an first author in Italy are 10% less. Similarly, a first author with a Japanese name has 10% lower odds of selfcitation compared to an author with an English name, while an author with an Italian name has 6% lower odds. Authors of with Nordic names self-cite more than any other ethnicity, Error bars indicate 95% confidence intervals. Among other interesting points, note that the likelihood of self-citation is least for last authors with non-USA affiliation, implying that self-citing is customary among USA authors. X-axis is on log scale. https://doi.org/10.1371/journal.pone.0195773.g004 Self-citation is the hallmark of productive authors, of any gender regardless of gender-a behavior that would contravene Janteloven [52] if construed as selfpromotion or an attempt to "game the system"; but perhaps language is a barrier for the Nordic too.
Citations increase the visibility of a paper and presumably the visibility of its author(s). This is self-evident; and so the result that prior citations of the reference is the most influential in the first author model is unsurprising. For first authors, a reference paper published in the same year as the paper citing it increases the odds of self-citation by 70%; for last authors, the same publication year doubles the odds. The effect of a multi-year difference is strongly tempered, meaning only recent references are more likely to be self-citations. What makes these particular results interesting is that, for both first and last authors, references with any prior citations decrease the odds of self-citation, more for first authors than last authors. Thus, for both first and last authors, the odds of self-citation increases when the reference is recent and has few if any citations. This corroborates the finding of Aksnes et. al [2], which states that most of the early citations of an article are self-citations; and might point towards the opportunistic nature of authors of bootstrapping the initial citations of their prior work through means of self-citations. The effect of gender is reduced when accounting for these additional factors, as evident from the dramatic change in odds of self-citation (Fig 2). This probabilistic model suggests that self-citation is more about opportunity, accessibility, and visibility than culture or gender. These results in the self-citation model vary only sightly when the analysis is conducted on a filtered data set where the country of affiliation is USA, ethnicity is ENGLISH, source language is ENGLISH, number of references is at least 10 (and at-max 100 for first authors and 50 for last authors), and source publication type is JOURNAL. The coefficient for female gender remains same, while the coefficient for male authors increases slightly for both first and last authors while becoming non-significant (p >0.05). This indicates that the results are robust under the consideration of the largest portion of the data set (See Table 7).

Discussion
The takeaways from the present study are threefold. First, models that lack sufficient controls jeopardize the conclusions drawn from them, potentially with adverse effects on public perception and public policy. Human social behavior is complex and, thus, unlikely to be explained adequately without a diverse set of controls. The concern becomes clear when we consider the dramatic change in gender's effect in this study with the introduction of confounding factors. MacRoberts et al. [53] asserted the following: "Today, in spite of an overwhelming body of evidence to the contrary, citation analysts continue to accept the traditional view of science as a privileged enterprise free of cultural bias and self-interest and accordingly continue to treat citations as if they were culture free measures. ". If cultural bias and self-interest influence self-citations, they are perhaps more charitably regarded as aspects of opportunity, accessibility, and visibility. First and last author positions on a byline enhance the visibility of the authors who occupy them, even if the position itself indicates nothing about the relative contribution of the author [54]. However, if we assume that authors in the most prominent positions on a byline have more influence over citations in the manuscript, we should not have difficulty imagining that particular positions offer increased opportunity for self-citation. From the results herein, the opportunities likely depend on a reference's accessibility (physical, economic) as well as the degree to which the citation enhances the cited's visibility (in terms of the work's topical relevance, despite specialization, or the author's prominence).
Second, self-citation is the hallmark of highly productive authors, of any gender, who cite their novel journal publications early in similar venues. As a result, papers by authors with short, disrupted, or diverse careers lack the initial boost in visibility gained from self-citations.
This disproportionately affect women who tend to leave (and enter) science at higher rates than men (see Table 8), and have different career trajectories for a variety of reasons [55]. Prior work has found that men specialize more than women and therefore can publish more [56]. However, in our dataset we find that men and women have the same age-normalized expertise, on average (Fig 5). In other words, attrition is the most likely driver of overall  differences in self-citation rates, not topical specialization or citation strategies. Reducing attrition and achieving a more gender-balanced scientific workforce [57] is essential to improving scientific progress. It should help increase self-citation, a signal that more research is followedup by the scientists best situated to do so. Third, the scientometric practice of discounting self-citations has the adverse effects of penalizing so-called lines of research and reducing the impact of some important papers already suffering from lower discoverability and accessibility because of non-English language, lack of bibliographic indexing, and so on. Authors most likely to be affected by such a penalty are perhaps the most dependent on sponsorship and patronage in science [58,59]. Citations reflect potentially many different authorial attitudes: to credit the source of inspiration; to aid the understanding of the reader; to assert authority in a field [48]; but self-citations acknowledge an individual's line of research. One might worry about "salami-slicing" [60] and disciplinary "siloing" [61] but one should also hope that scholars conduct research in the way that is effective for them individually. Tempering labels like being "your own favorite expert" [17] would be a good start. After all, a persistent lack of (self-)citations likely reflects dead-ends and orphans not even nurtured by the scholars with a vested interest.