Gender disparity in computational biology research publications

While women are generally underrepresented in STEM fields, there are noticeable differences between fields. For instance, the gender ratio in biology is more balanced than in computer science. We were interested in how this difference is reflected in the interdisciplinary field of computational/quantitative biology. To this end, we examined the proportion of female authors in publications from the PubMed and arXiv databases. There are fewer female authors on research papers in computational biology, as compared to biology in general. This is true across authorship position, year, and journal impact factor. A comparison with arXiv shows that quantitative biology papers have a higher ratio of female authors than computer science papers, placing computational biology in between its two parent fields in terms of gender representation. Both in biology and in computational biology, a female last author increases the probability of other authors on the paper being female, pointing to a potential role of female PIs in influencing the gender balance.


Introduction 20
There is ample literature on the underrepresentation of women in STEM fields and the biases contributing to 21 it. Those biases, though often subtle, are pervasive in several ways: they are often held and perpetuated by 22 both men and women, and they are apparent across all aspects of academic and scientific practice.
scientists is more likely to be attributed to a male colleague [11], and biographies of successful female 35 scientists perpetuate gender stereotypes [12]. Finally, the way in which evidence for gender bias is received is 36 in itself biased: Male scientists are less likely to accept studies that point to the existence of gender bias than 37 are their female colleagues [13]. 38 Although gender imbalance seems to be universal across all aspects of the scientific enterprise, there are also 39 more nuanced effects. In particular, not all disciplines are equally affected. For instance, in the biosciences 40 over half of PhD recipients are now women, while in computer science, it is less than 20% [14]. This raises an 41 intriguing question, namely how do the effects of gender persist in interdisciplinary fields where the parent 42 fields are discordant for female representation? 43 To this end, we are interested in the gender balance in computational biology and how it compares to other 44 areas of biology, since computational biology is a relatively young field at the disciplinary intersection 45 between biology and computer science. We examined authorship on papers from Pubmed published between 46 1997 and 2014 and compared computational biology to biology in general. We found that in computational 47 biology, there is a smaller proportion of female authors overall, and a lower proportion of female authors in 48 first and last authorship positions than in all biological fields combined. This is true across all years, though 49 the gender gap has been narrowing, both in computational biology and in biology overall. A comparison to 50 computer science papers shows that computational biology stands between biology and computer science in 51 terms of gender equality.

53
In order to determine if there is a difference in the gender of authors in computational biology compared to 54 biology as a whole, we used data from Pubmed, a database of biology and biomedical publications 55 administered by the US National Library of Medicine. Pubmed uses Medical Subject Heading (MeSH) terms 56 to classify individual papers by subject. The MeSH term "Computational Biology" is a subset of "Biology" 57 and was introduced in 1997, so we restricted our analysis to primary articles published after this date (see S1 58  To determine the gender of authors, we used the web service Gender-API.com, which curates a database of 60 first names and associated genders from government records as well as social media profiles. Gender-API 61 searches provide information on the likely gender as well as confidence in the estimate based on the number of 62 times a name appears in the database. We used bootstrap analysis to estimate the probability (P f emale ) that 63 an author in a particular dataset is female as well as a 95% confidence interval (see Materials and Methods). 64 We validated this method by comparing it to a set of 2155 known author:gender pairs from the biomedical 65 literature provided by Filardo et. al. [15] Filardo and colleagues manually determined the genders of the first 66 authors for over 3000 papers by searching for authors' photographs on institutional web pages or social 67 media profiles like LinkedIn. We compared the results obtained from our method of computational inference 68 of gender for a subset of this data (see Materials and Methods), to the known gender composition of this 69 author set. Infering author gender using Gender-API data suggested that P f emale = 0.373 ± 0.023 70 (Supplementary Fig 1C, black bar). Because the actual gender of each of these authors is known, we could 71 also calculate the actual P f emale . Using the same bootstrap method on actual gender (known female authors 72 were assigned P f emale = 1, known male authors were assigned P f emale = 0), we determined that the real 73 P f emale = 0.360 ± 0.018 (S1 Fig C, Table 1). We observed the same trend in papers labeled with the 87 computational biology (comp) MeSH term, though the P f emale at every author position was 4-6 percentage 88 points lower. An analysis of publications by year suggests that the gender gaps in both biology and 89 computational biology are narrowing, but by less than 1 percentage point per year (for bio, change in 90 P f emale = 0.0035 ± 0.0005/year, for comp, change in P f emale = 0.0049 ± 0.0008/year). However, the 91 discrepancy between biology and computational biology has been consistent over time ( Fig 1B).

Fig 1.
A: Mean probability that an author in a given position is female for primary articles indexed in Pubmed with the MeSH term Biology (black) or Computational Biology (grey). The bio dataset is inclusive of papers in the comp dataset. Error bars represent 95% confidence intervals. B: Mean probability that an author in a given position is female for primary articles indexed in Pubmed with the MeSH term Biology (black) or Computational Biology (grey). The bio dataset is inclusive of papers in the comp dataset. Error bars represent 95% confidence intervals. C: Mean probability that an author is female for publications in a given year. Error bars represent 95% confidence intervals. D: Mean probability that the first (F), second (S), penultimate (P) or other (O) author is female for publications where the last author is male (P f emale < 0.2) or female (P f emale > 0.8). Papers where the gender of the last author was uncertain or could not be determined were excluded. Error bars represent 95% confidence intervals.
One possible explanation for the difference in male and female authorship position might be a difference in 93 role models or mentors. If true, we would expect studies with a female principal investigator to be more likely 94 to attract female collaborators. Conventionally in biology, the last author on a publication is the principal 95 investigator on the project. Therefore, we looked at two subsets of our data: publications with a female last 96 author (P f emale > 0.8) and those with a male last author (P f emale < 0.2). We found that women were 97 substantially more likely to be authors at every other position if the paper had a female last author than if 98 the last author was male ( Fig 1C, Table 2). It is possible that female trainees are be more likely to pursue 99 computational biology if they have a mentor that is also female. Since women are less likely to be senior 100 authors, this might reduce the proportion of women overall. However, we cannot determine if the effect we 101 observe is instead due to a tendancy for women that pursue computational biology to select female mentors. 102 Though MeSH terms enable sorting a large number of papers regardless of where they are published, the 103 assignment of these terms is a manual process and may not be comprehensive for all publications. As another 104 way to qualitatively examine gender differences in publishing, we examined different journals, since some 105 journals specialize in computational papers, while others are more general. We looked at the 123 journals 106 that had at least 1000 authors in our bio dataset, and determined P f emale for each journal separately (Fig 107  2A). Of these journals, 21 (14%) have titles indicative of computational biology or bioinformatics, and these 108 journals have substantially lower representation of female authors. The 3 journals with the lowest female  A: Mean probability that an author is female for every journal that had at least 1000 authors in our dataset. Grey bars represent journals that have the words "Bioinformatics," "Computational," "Computer," "System(s)," or "-omic(s)" in their title. Vertical line represents the median for female author representation. See also S1 Table. B: Mean probability that an author is female for articles in the "Bio" dataset (black dot) or in the "Comp" dataset (open square) for each journal that had at least 1000 authors plotted against the journals' 2014 impact factor. Journals that had computational biology articles are included in both datasets. An ordinary least squares regression was performed for each dataset. Bio: m = −0.00264, P Z>|z| = 0.0022. Comp: m = −0.000789, P Z>|z| = 0.568.
One possible explanation might be that women are less likely to publish in high-impact journals, so we 112 considered the possibility that the differences in the gender of authors that we observe could be the result of 113 differences in impact factor between papers published in biology versus computational biology publications. 114 We compared the P f emale of authors in each journal with that journal's 2014 impact factor (Fig 2B). There 115 is a marginal but significant negative correlation (−0.00264, P Z>|z| = 0.0022) between impact factor and 116 gender for the biology dataset. This is in contrast to previous studies from engineering that have found that 117 women tend to publish in higher-impact journals [2]. It is, however, consistent with a previous studies from 118 mathematics [1]. By contrast, there is no significant correlation (P Z>|z| = 0.568) between impact factor and 119 P f emale in computational biology publications. Further, for journals that have articles labeled with the 120 computational biology MeSH term, the P f emale for those articles is the same or lower than that for all 121 biology publications in the same journal. 122 We also examined whether computational biology or biology articles tend to have higher impact factors.

123
Bootstrap analysis of authors in each dataset suggest that computational biology publications tend to be 124 published in journals with a higher impact factor (ĪF = 7.25 ± 0.04) than publications in biology as a whole 125 (ĪF = 6.5 ± 0.02). However, given the magnitude of the correlation between IF and P f emale , this difference is 126 unlikely to explain the differences in P f emale observed between our computational biology and biology 127 datasets. Taken together, these data suggest that the authors of computational biology papers are less likely 128 to be women than the authors of biology papers generally. 129 We turned next to an investigation of biological fields relative to computer science. Since Pubmed does not 130 index computer science publications, we cannot compare the computational biology dataset to computer 131 science research papers directly. Instead, we investigated the gender balance of authors of manuscripts 132 submitted to arXiv, a preprint repository for academic papers used frequently by quantitative fields like 133 mathematics and physics. These preprint records cannot be compared to peer-reviewed publications indexed 134 on pubmed, but a "quantitative biology" (qb) section was added to arXiv in 2003. Quantitative biology is not 135 necessarily equivalent to computational biology, and analysis of arXiv-qb papers that have been published 136 and indexed on pubmed suggests that only a fraction of them are labeled with the "computational biology" 137 MeSH term. However, this does allow us to make an apples-to-apples comparision between a field of biology 138 and computer science. There are relatively few papers preprints prior to 2007, so we compared preprints in 139 "quantitative biology" to those in "computer science" from 2007-2016. 140 Women were more likely to be authors in quantitative biology manuscripts than in computer science 141 manuscripts in first, second, and middle author positions (Fig 3A, Table 3). We found no significant 142 difference in the frequency of female authors in the last or penultimate author positions in these two datasets, 143 though the conventions for determining author order are not necessarily the same in computer science as in 144 biology. Nevertheless, women had higher representation in quantitative biology than in computer science for 145 all years except 2009 ( Fig 3B). Interestingly, there is a slight but significant (0.0052/year, P Z>|z| < 0.005) 146 increase in the proportion of female authors over time in quantitative biology, while there's no significant 147 increase in female representation in computer science preprints. 148 Fig 3. A: Mean probability that an author in a given position is female for all preprints in the arXiv quantitative biology (black) or computer science (grey) categories between 2007 and 2014. Error bars represent 95% confidence intervals. B: Mean probability of authors being female in arXiv preprints in a given year. Error bars represent 95% confidence intervals. Slopes were determined using ordinary least squares regression. The slope for q-bio is slightly positive (p < 0.05), but the slope for cs is not.
Taken together, our results suggest that computational biology lies between biology in general and computer 149 science when it comes to gender representation in publications. This is perhaps not surprising given the 150 interdisciplinary nature of computational biology. Compared to biology in general, computational biology 151 papers have fewer female authors, and this is consistent across all authorship positions. Importantly, this 152 difference is not due to a difference in impact factor between computational biology and general biology 153 papers.

154
Articles with a female last author tend to have more female authors in other positions and this is true for 155 both biology in general and computational biology. Since the last author position is most often occupied by 156 the principal investigator of the study, this suggests that having a woman as principal investigator has a studied the nature of authorship contribution by gender in PLoS publications [3]. They found that if the 159 corresponding author of a paper was female, then there was also a greater proportion of women across almost 160 all authorship roles (data analysis, experimental design, performing experiments, and writing the paper). In 161 contrast, if the corresponding author was male, then men were dominating all authorship roles except for 162 performing experiments, which remained female-dominated. The reasons for this are difficult to ascertain. It 163 could be the case that female PIs tend to work in more female-dominated sub-fields and therefore naturally 164 have more female co-authors. It is also possible that female PIs are more likely to recognise contributions by 165 female staff members, or that they are more likely to attract female co-workers and collaborators. Our 166 publication data cannot differentiate between those two (and other) explanations, but points to the 167 important role that women in senior positions may play as role models for trainees.

168
Since biology attracts more women than computer science, we suspect that many women initially decide to 169 study biology and later become interested in computational biology. If this is the case, understanding what 170 factors influence the field of study will provide useful insight when designing interventions to help narrow the 171 gender gap in computer science and computational biology.

204
Mean gender probabilities were determined using bootstrap analysis. Briefly, for each dataset, authors were 205 randomly sampled with replacement to generate a new dataset of the same size. The mean P f emale for each 206 sample was determined excluding names for which no gender information was available ( 26.6% of authors). 207 The reported P f emale represents the mean of means for 1000 samples. Error bars in figures represent 95% Supporting Information . Grey represents the known proportion of female authors when excluding names for which the gender could not be computationally inferred. Error bars represent 95% confidence intervals.

S2 Fig: A: Mean probability that an author in a given position is female for primary articles indexed in
Pubmed with the MeSH term Biology (black), Computational Biology (gray) or for those articles with Biology but not Computational biology (white). Error bars represent 95% confidence intervals. B: Mean probability that an author is female for articles in the "Bio" dataset (black) in the "Comp" dataset (white), or for articles in the Bio but not Comp (gray) for each journal that had at least 1000 authors plotted against the journals' 2014 impact factor. Excluding computational publications from the biology dataset does not substantially alter the correlation between impact factor and P female . S1 Table. P female for each journal with at least 1000 authors in the bio dataset. Journals identified as primarily computational are shaded grey. Year P(female)

Bio Comp
Primary Articles Fig1 Fig 2. A: Mean probability that an author is female for every journal that had at least 1000 authors in our dataset. Grey bars represent journals that have the words "Bioinformatics," "Computational," "Computer," "System(s)," or "-omic(s)" in their title. Vertical line represents the median for female author representation. See also S1 Table. B: Mean probability that an author is female for articles in the "Bio" dataset (black dot) or in the year. Error bars represent 95% confidence intervals. Slopes were determined using ordinary least squares regression. The slope for q-bio is slightly positive (p < 0.05), but the slope for cs is not.

S1Fig
Click here to download Figure S1Fig.eps A: Mean probability that an author in a given position is female for primary articles indexed in Pubmed with the MeSH term Biology (black), Computational Biology (gray) or for those articles with Biology but not Computational biology (white). Error bars represent 95% confidence intervals. B: Mean probability that an author is female for articles in the "Bio" dataset (black) in the "Comp" dataset (white), or for articles in the Bio but not Comp (gray) for each journal that had at least 1000 authors plotted against the journals' 2014 impact factor. Excluding computational publications from the biology dataset does not substantially alter the correlation between impact factor and P f emale