A large-scale analysis of bioinformatics code on GitHub

In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.


Repository names
Throughout the Supplemental Information, the term "repository name" is used to refer to a GitHub username and repository separated by a forward slash, e.g., "pamelarussell/github-bioinformatics".

Authentication
In order to reproduce the data and results in the paper using the Perl pipeline and associated code, authentication is required for Google as well as the GitHub API. Anyone wishing to reproduce the data and results in the paper needs a Google account and a GitHub account.
Users will need to set up a project on BigQuery in order for dataset creation to work. The Google Sheets containing the repository lists need to be saved in the user's Google Drive account. The Google Sheets containing the repository lists formatted for import into BigQuery also need to be saved in the Google Drive account, and BigQuery tables need to be set up that contain the data in these sheets. This can be done by first linking a BigQuery table to the sheet,   then querying for the entire contents of the table and saving as a regular BigQuery table for simpler authentication. Google credentials provide access to BigQuery (where the data tables are written and stored) as well as Google Sheets (where the repository lists are stored). Google credentials are stored in a JSON key. GitHub credentials are stored as an OAuth token. For the pipeline, Google and GitHub credentials are defined in the config file, not included in the public GitHub repository for the project. R code uses the "bigrquery" package [1] , which prompts for Google login credentials and saves an OAuth token. The GitHub API is not accessed from any of the R code.

Reproducibility
All data extracted from the GitHub API, with modifications described in the next paragraphs, are available at https://doi.org/10.17605/OSF.IO/UWHX8. Although all of the repositories studied are public on GitHub and were announced in published articles, many do not include explicit open source licenses. Therefore, the actual contents of the repositories cannot be included as a public dataset along with this paper. Instead, we have recorded the specific Git version of each file so that the exact dataset can be regenerated by other individuals. This table, provided as "file_info_main.csv" and "file_info_high_profile.csv", contains the Git URL pointing to the specific record for each file. Readers who wish to reproduce the exact dataset for the paper can copy each file info table to a BigQuery table (in a respective BigQuery project for the main or high-profile dataset) named "repos:file_info" and continue with file content extraction in the Perl pipeline by setting "generate_file_contents" to true in the config file (skipping "generate_file_info", which would create the "repos:file_info" table for current versions of repository contents).
Similarly, the GitHub Terms of Service prohibit sharing of personal identifying information including names and e-mail addresses. (We used names for the gender analysis.) We have published commit records as "commits_main.csv" and "commits_high_profile.csv" with identifying information removed. Each published record includes an API reference for the commit so the full record can be reconstructed from the GitHub API if needed.
All of the other data extracted from the GitHub API are also included at https://doi.org/10.17605/OSF.IO/UWHX8; these can similarly be uploaded to BigQuery and the data extraction steps in the Perl pipeline can be skipped. Readers who wish to generate a dataset but do not require the exact dataset used for this paper can follow the entire Perl pipeline, which will capture the current state of repositories.

Supplemental Section 2 : Identification of bioinformatics repositories on GitHub
We identified GitHub repositories containing bioinformatics code by (1) identifying published journal articles containing mentions of GitHub, (2) manually selecting bioinformatics articles from among these results, (3) automatically extracting names of GitHub repositories within the articles, and (4) manually curating the GitHub repositories using surrounding context in the articles.

Motivation for search strategy
We settled on this strategy after several previous iterations brought various issues to light. We initially considered searching GitHub itself for the term "bioinformatics", but quickly realized that (1) the resulting set of repositories shared no common benchmark making them comparable to one another, such as having reached the point of a publication, and (2) many or most bioinformatics repositories do not explicitly mention the term "bioinformatics". Because of these issues, we decided to use repositories that had been prepared and theoretically vetted along with peer-reviewed publications; additionally, this choice allowed us to analyze article metadata as well.
After deciding to use repositories that had been published along with papers, we attempted several automatic search strategies. A PubMed search for the term "GitHub" revealed the fact that a large proportion of the search results would not be considered "bioinformatics" articles by our community. To give a few very brief examples, non-bioinformatics topics we discovered included radiomics, pure statistics, mathematics, and articles about GitHub having no connection to biology. We decided that it would be unacceptable to include these non-bioinformatics articles. We then attempted to use machine learning to automatically classify article abstracts as bioinformatics or not, using a hand-labeled set of hundreds of abstracts. The best classifier we trained was approximately 90% accurate and still admitted sufficiently many non-bioinformatics articles to be unacceptable. We therefore decided that it would be necessary to hand-label each article as bioinformatics or not. We recognized the compromise of this decision: we would have to limit the size of the dataset to accommodate hand labeling, but believed that this would be preferable to including many irrelevant repositories that could bias the analysis.
We designed a literature search strategy (described in detail in "Literature Search" below) that would be as unbiased as possible while returning a number of results that could be reasonably hand-labeled by two members of our group. We identified 2,679 articles containing the term "GitHub" in the full text, and labeled 1,950 of these as "bioinformatics".
We automatically extracted the names of 3,188 GitHub repositories mentioned in the 2,679 articles (described in "Automatic extraction of repository names from articles" below). Again, we quickly realized that many of these repositories should not be included in our dataset (even if the article was a bioinformatics article), the most common reason being that authors would mention repositories they had had no part in developing. (An important goal of our analysis was to analyze repositories along with the paper in which they were first published.) We therefore manually labeled each GitHub repository as being part of the work described in the paper or not.
Finally, as briefly mentioned above, part of our analysis depended on identifying the single paper that first announced each repository. Therefore, in the small number of cases where a repository was mentioned in multiple distinct papers, we used the original article announcing the repository (described in "Manual curation of repository names" below). Table S4 lists the labels associated with each step of this decision process for each article and repository name.

Literature search
The goal of the literature search was to identify and collect articles that mention the term "GitHub" in the title, abstract or anywhere in the full text. Because the primary biomedical databases (PubMed, Embase) only contain article metadata (title, abstract, etc.) and not full text, we took a two-pronged approach. We designed a search strategy that looked for the term "GitHub" in the title/abstract, but also included computer programming terminology that could indicate the use of "GitHub" in the full text. To identify the relevant proxy language to search for, we used three approaches. First, we looked at citations that did indeed mention "GitHub" and harvested relevant associated or surrounding language. Second, we ran the search "GitHub" in PubReMiner to look for other text or MeSH terms that frequently appeared in citations that mention "GitHub". Finally, we relied on our content knowledge to identify other relevant terms.
Because of the laborious nature of locating and processing large batches of full-text PDFs, we The search performed in Embase was the following: "((algorithm:ti,ab OR toolkit:ti,ab) AND (code:ti,ab OR software:ti,ab)) OR (analysis:ti,ab AND next:ti,ab AND (framework:ti,ab OR pipeline:ti,ab OR software:ti,ab OR tool:ti,ab)) OR 'command line':ti,ab OR ((framework OR 'freely available' OR pipeline OR 'publicly available' OR workflow) NEAR/4 (code OR software)):ti,ab OR github*:ti,ab OR 'open source':ti,ab OR 'programming language':ti,ab OR (software:ti,ab AND next:ti,ab AND (application:ti,ab OR framework:ti,ab OR package:ti,ab OR pipeline:ti,ab OR program:ti,ab OR suite:ti,ab OR tool:ti,ab)) OR 'source code':ti,ab OR 'web app*':ti,ab AND [2008-2017]/py" Due to a searcher error, not identified until after citation processing and analysis, the intended adjacency operator "next" was treated as a text word by Embase and "AND"-ed together with other search terms within the respective scope. This error did not affect the search for "GitHub" in the title/abstract. Rather it required the word "next" to appear with the associated computer/software terms. This limited overall search results, but kept the search within the manageable 30,000 citation range.
All citations were exported to EndNote X7. EndNote's full-text harvesting tool was used to batch harvest PDFs. No manual harvesting of PDFs was performed. 18,764 full-text PDFs were located by EndNote. All citations containing "GitHub" in the title/abstract and all located PDFs were exported for external programmatic analysis. We identified 2,679 articles containing the case-insensitive term "GitHub" somewhere in the full text.

Definition of bioinformatics
A detailed list of bioinformatics topic categories was compiled. First, the published scope of the journal Bioinformatics was downloaded on 25 June 2017 from [2] . Each category in the journal scope, along with its detailed description, was included. Second, a few additional categories were taken from the Wikipedia article on "Bioinformatics" on 25 June 2017 [3] (stable URL).
Finally, an additional topic "Pipelines, wrappers, extensions, and utilities" was included to capture these software papers. Descriptions of each category taken from their sources are in Table S1 .

Manual curation of bioinformatics articles
Each article identified in the literature search that contained the term "GitHub" in the full text was manually evaluated to determine if its contents pertained to bioinformatics topics. The set of articles was divided into two subsets and each subset was evaluated by one person (R.J. and P.R.) due to the large time commitment involved. For each article, the title and abstract were examined. The article was classified as "bioinformatics" if the title or abstract treated at least one of the topics in the definition of bioinformatics. The results of the manual classification are presented in Table S2 .

Automatic extraction of repository names from articles
Repository names were automatically extracted from all articles identified in the literature search, including those not identified as "bioinformatics". The script was run through the Perl pipeline by setting "extract_repos_from_lit_search" to true in the config file. Briefly, the operation of the script is as follows. The XML files of article metadata exported from EndNote were parsed and metadata for all articles were extracted. For each article, first, an attempt was made to identify repository names in the abstract by searching for and parsing matches to one of the If repository names were found in the abstract, these were returned and no attempt was made to analyze the full text. If no repository names were found in the abstract, the full text PDF was analyzed using the Python package pdfminer [4] ; matches to the same regular expressions were identified and parsed to extract repository names. The script saved the results to a table on BigQuery and this table was saved as a Google Sheet. The table is available as Table S3 .

Manual curation of repository names
Spreadsheet for manual curation of repository names.
The final curation is presented as Table S4 . To create this table, the manual curation of bioinformatics articles was joined to the automatic extraction of repository names from articles to identify automatically extracted repository names contained in bioinformatics articles. A Google spreadsheet was created containing the join. From that point, this spreadsheet was manually adjusted ( Table S4 contains a sheet "field_definitions" that defines the columns and explains which columns have been manually modified). The column "use_repo" in the spreadsheet contains the ultimate directive of whether the repository was to be included in the final dataset or not, and could be manually set for various reasons described below. The complete definition of the logic in this column is provided in "field_definitions".

Manual deduplication of repository names.
Duplicate repository names were identified. Duplicates could occur when the same article was returned multiple times by the literature search, leading to multiple EndNote records. This could also happen if the same repository was mentioned in multiple different articles. In these cases, records were manually deduplicated. If the same article was returned from multiple databases, the PubMed record was kept and the other records were deleted (it was always possible to keep a PubMed record). If the same article was in the same database with two different dates, the earlier record was kept in "use_repo" and the later record was not used. In more complex cases, such as multiple distinct articles mentioning the same repository, the articles were manually examined and at most one article was set to be used in "use_repo"; the article chosen was the one originally announcing the repository.

Manual checking and correction of repository names.
For each repository name in a bioinformatics article, the surrounding context of the abstract or article was manually examined to determine if the repository contained code for the article, as opposed to the article mentioning an outside repository. This determination was manually entered in the column "repo_from_pdf_is_code_for_bioinf_paper" of the spreadsheet provided as Table S4 . If a repository name had been discovered from the article abstract, only the abstract was examined during this manual process; the PDF was not examined. If a repository name had been discovered from the full text PDF, the entire PDF was examined. In some cases, errors in repository names were discovered during this manual curation process; these were manually fixed where possible. (For example, errors in repository names could be caused by ambiguously hyphenated line breaks or special formatting such as indented bullets in the PDF, or missing spaces after the repository name in the abstract downloaded from a literature database.)

Special issues for PDFs.
In addition to the repositories automatically detected from the PDF, additional repositories not identified by the automatic process were added by searching for the string "github"; these could have been missed by the automatic script due to the issues mentioned previously. Therefore, in order for any repository names to be included from a PDF, at least one repository name needed to have been identified automatically. PDFs for which no repository names were automatically identified were not manually examined and therefore were not allowed to contribute any repository names to the final dataset.

Special situations during manual curation.
• In some cases, if multiple repositories were mentioned in an article and it was impossible to tell from context whether the repositories were developed by the article authors, we viewed the repositories on github.com to evaluate contributors to the repositories.
• Articles published in the journal F1000 often include pointers to two code repositories: one containing stable frozen code at the time of publication and another containing the development version. In these cases, the development version was used and the stable version at the time of publication was not used, for consistency with other repositories that are all theoretically development versions.
• BioJava [5] and BioJS [6] are open source projects that collect multiple components from different contributors under a single parent GitHub repository (biojava/biojava and biojs/biojs, respectively). Components of these projects were not used because our analysis is performed at the repository level, and components of these projects are subdirectories under a common repository.

Checking validity of repository names
After the manual curation of repository names, a script was run to verify the current existence of repositories marked to be used in the final dataset. The script was run in the Perl pipeline by setting "check_repo_existence" to true in the config file. The script prints a list of repositories whose existence could not be verified through the GitHub API. Most of these turned out to have moved or changed names; these repository names were manually corrected. This step also revealed more repository names containing errors due to the automatic parsing of the abstract or PDF; these were manually corrected. Repositories that could not be found at all were not used. After the manual modifications in this step, no issues with repository names were identified by the script.

Identification of high-profile bioinformatics repositories
In addition to the repositories identified through the literature search, we curated a set of "high-profile" projects: highly respected and well-known tools in the bioinformatics community.
Most of these projects were not identified in the literature search. In many cases, high-profile projects were not hosted on GitHub at the time of publication. These projects also could have been absent from the literature search because the papers did not mention GitHub, because the papers did not match the heuristics used in the search, or because the code is not publicly available.
To avoid subjective judgements or omissions of popular tools, we chose to define high-profile projects as those generating a high volume of discussion in the leading online forum for discussion of bioinformatics topics, Biostars [7] . We accessed Biostars on 10 February 2018, compiling a list of standalone software tools that had been tagged in posts at least 100 times.
We chose to draw the boundary at standalone tools because this provided a clean criterion we could use to judge the sometimes ambiguous Biostars tags, but acknowledge that our chosen criterion excludes a few popular libraries and conglomerations such as Bioconductor and Galaxy. The list included 27 tools. Through a manual web search, we were able to identify a primary GitHub repository hosting the code for 21 of these tools. Four tools do not appear to be hosted publicly on GitHub, while two tools are included under another repository already in the set of 21. In one case, Samtools [8] , the project was spread across multiple GitHub repositories and we curated three repositories containing the main code for the project, bringing the number of repositories to 23. Three high-profile repositories (alexdobin/STAR, bcgsc/abyss, and chrchang/plink-ng) are also in the dataset curated from the literature search; they are included in both sets for analysis. We performed a manual search to identify the original publications describing each project; we were able to find publications for 21 of the 23 repos, while two remain unpublished. Details are presented in Table S5 . This set of 23 repositories is referred to as the "high-profile" dataset, while the set identified through the literature search is referred to as the "main" dataset.

Extraction of article metadata
Metadata for articles associated with each repository were extracted from NCBI databases using the RISmed R package [9] with the script src/R/ncbi/paper_metadata_eutils.R. The script was run through the Perl pipeline separately for the main and high-profile datasets by setting "query_eutils_article_metadata" to true in the config file. Metadata retrieved include database IDs, journal information, funding information, relevant dates, article abstract, and number of citations in PubMed Central, the National Library of Medicine's archive of open access full-text biomedical and life sciences articles.

Supplemental Section 3 : Extraction of repository data from GitHub API
Several types of data were extracted from the GitHub REST API v3 [10] ; each is described in a subsection below. These scripts were run separately for the main and high-profile datasets; each dataset was stored in a separate BigQuery project. The BigQuery projects and empty datasets within each project were created manually in the BigQuery web interface. Data for the high-profile dataset were extracted approximately four months after the main dataset.

Workflow components common to all data types
Python scripts were used to obtain each type of data. All scripts use the gspread library [11] to read the list of repository names from the Google Sheet containing the manual curation of repositories ( Table S4 ). All scripts make Curl requests to the GitHub API using the PycURL library [12] and parse the JSON responses to convert information to flat records. All scripts push data to tables in Google BigQuery using the BigQuery-Python library [13] .

Repository-level metrics
Repository-level metrics were extracted from the GitHub Repositories API [10] and pushed to a BigQuery table by the script src/python/gh_api_repo_metrics.py. The script was run through the Perl pipeline by setting "generate_repo_metrics" to true in the config file. Repository-level metrics include (1) repository name, (2) GitHub API URL for the repository,

File information
Information on individual files contained in each repository was extracted from the GitHub Contents API [10] and pushed to a BigQuery table by the script src/python/gh_api_file_info.py.
The script was run through the Perl pipeline by setting "generate_file_info" to true in the config file. Recursive requests were constructed in order to access the entire directory structure of each repository. Information for regular files and symbolic links was retrieved. Submodules were not included because these often contain code not developed by the authors of the main repository. Information retrieved for each file includes (1) repository name, (2) file name, (3) file path, (4) file SHA-1 hash, (5) file size, (6) GitHub API URL for the file, (7) HTML URL for the file, (8) Git URL for the file, (9) download URL for the file, (10) file type, (11) SHA-1 hash of the most recent commit reference to the master branch, and (12) time at which the information was accessed.

File creation dates
Initial commit timestamps for each file were extracted from the GitHub Repositories API [10] and pushed to a BigQuery table by the script src/python/gh_api_file_init_commit.py. The script was run through the Perl pipeline by setting "generate_file_init_commits" to true in the config file.
Commits affecting each file were accessed via the repository name and path as stored in the file information table; the oldest time at which a committer committed the file was stored.

File contents
Contents of individual files were extracted from the GitHub Repositories API [10] and pushed to a BigQuery table by the script src/python/gh_api_file_contents.py. The script was run through the Perl pipeline by setting "generate_file_contents" to true in the config file. File contents were accessed via their Git URL as stored in the file information table, so that records in the two tables refer to exactly the same versions of each file. This was important due to the duration of time needed to extract all the file contents. Information retrieved for each file includes (1) repository name, (2) file name, (3) file path, (4) file SHA-1 hash, (5) Git URL for the file, (6) file contents, and (7) time at which the information was accessed. File contents were decoded from the Base64 encoding returned by the GitHub API. Files whose contents exceed 999KB in size were included in the results table but contents were marked as "null" due to the 1MB row size limit in BigQuery and also the fact that almost none of these files contain source code.

Licenses
Repository licenses were extracted from the GitHub Repositories API [10] and pushed to a BigQuery table by the script src/python/gh_api_licenses.py. The script was run through the Perl pipeline by setting "generate_licenses" to true in the config file. For each repository, information extracted included (1) repository name, (2) license, (3) SHA-1 hash of the most recent commit reference to the master branch, and (4) time at which the information was accessed. License information is returned by the API when it can be detected from the repository's license file.
Repositories without a detectable license were recorded as "null" in the BigQuery table.

Note on iterating through files
We needed to pull down the contents of each file from our contents table in BigQuery and save it to a local file in order to analyze it with cloc ( Supplemental Section 5 ). Although the Google Cloud API supports iterating through records in a BigQuery table, there is a limit on record size that was exceeded by many of our contents records. Therefore, we exported the contents table to multiple CSV files on Google Cloud Storage; our analysis script downloaded these CSV files locally one at a time to analyze the subset of files contained therein. Therefore, people utilizing our analysis code would need to replicate the process of saving the contents table to multiple CSV files in Google Cloud Storage.

Supplemental Section 4 : Topic modeling of article abstracts
We used machine learning to infer topics for abstracts of the articles announcing each repository in the main dataset. Abstracts for the single curated article for each repository were obtained from the EndNote metadata (see Supplemental Section 2 ). Treating each abstract as a document, we created a latent Dirichlet allocation (LDA) model [14] using the "topicmodels" R package [15] and following the workflow in [16] . In LDA, the symbol β refers to the probability of a given term being generated from a given topic, and γ is the probability that a given document comes from a given topic. From the LDA model, we identified terms whose β value for their top topic was at least four times larger than the second highest topic. We manually examined the top terms for each topic from this list of topic-specialized terms. We tried several values for k in the model (the number of topics) and chose k = 8 for further analysis due to its maximal coherence of concepts within the top terms. We manually assigned a label to each of the eight topics that captures a summary of the top terms. We classified each article abstract into one or more topics by taking all pairs of abstracts and topics with γ equal to at least 0.25. The topic modeling analysis and figures ( Fig 2 , Fig A , Fig B , Fig C , Fig D ) were generated in paper/scripts/topics.Rmd.

Supplemental Section 5 : Programming languages
We attempted to identify a programming language, count lines of code and comment, and extract comment-stripped source code for each file. The script src/python/cloc_and_strip_comments.py calls the tool cloc (version 1.72) [17] to analyze the contents of each file in each repository. The script was run through the Perl pipeline by setting "run_cloc" to true in the config file. For each file, cloc attempts to identify the programming language, number of lines of code, number of comment lines, number of blank lines, and comment-stripped source code. Files with extensions indicating they did not contain source contents were duplicated in the dataset, usually appearing multiple times in the same repository with different paths and/or file extensions. In cases where cloc identified different language or line counts for these duplicate files (probably due to file extension heuristics used in cloc), all copies of the file were skipped. A similar filtering was performed on the comment-stripped code results from cloc. Results from cloc were saved to tables in BigQuery. This information was joined to other file metadata with the script src/python/run_bq_queries_analysis.py by setting "run_bq_analysis_queries" to true in the config file for the Perl pipeline.
Language execution modes were obtained from [18] . Type systems were obtained from [19] , and due to the absence of the popular language R from this table, R was manually added and labeled as "dynamic" and "unsafe". In order for the information to match the programming languages assigned to our data by cloc, in some cases language information records were copied to match the language names returned by cloc. These tables, provided as Table S6 and   Table S7 , were saved as Google Sheets. In order to reproduce the results in the paper, the tables must be copied to tables in BigQuery using the same procedure described in Supplemental Section 1 .

Supplemental Section 6 : Developer communities
We identified the number of commit authors and outside contributors (commit authors who are never committers) in the commit records for each repository. For commit authors, we attempted to count unique people by collapsing users with the same name or login, as individuals can contribute to a repository under multiple aliases (for example, from multiple devices with different default name settings). For outside contributors, we counted commit authors whose author ID is never a committer ID for the repository. Counts of commit authors and outside contributors were calculated in paper/scripts/repo_features.R. The counts of forks, subscribers and stargazers were extracted directly from the GitHub API in src/python/gh_api_repo_metrics.py by setting "generate_repo_metrics" to true in the config file for the Perl pipeline.

Inferring genders
The script src/R/gender/infer_gender.R attempts to infer a gender for each commit author, committer, and paper author in the dataset, then pushes the results to a BigQuery table. The script was run for the main and high-profile datasets through the Perl pipeline by setting "infer_gender" to true in the config file. We used the Genderize.io API [20] , which is a paid service above a certain usage rate; an API key is required for the script to function. Genderize accepts a first name and optional language and country, and returns a gender call along with the estimated probability of correctness. Although many GitHub users provide their geographic location as a free-form text field and articles include academic affiliations for authors, we chose not to use this information because (1) many developers and researchers do not live in their home country, making this information potentially misleading, and (2) it is challenging to convert free-form text to uniform country codes. The result of this decision is that we lack gender calls for some ambiguous names that could possibly be resolved by adding accurate geographic information. We note that we were only able to obtain author lists for 1,573 articles for the main dataset (covering 1,658 repositories) and and 18 articles for the high-profile dataset (covering 21 repositories) (see Extraction of article metadata), and that some author lists were not usable for gender analysis because they list first initials only. We did not use the original EndNote citations for author gender because they included first initials only.
We submitted first names (the first word before whitespace) to Genderize and accepted gender calls with a worldwide probability of 0.8 or higher. The main dataset contains 13,425 unique strings in the "author name" and "committer name" fields of the commit records and the author names of articles. Several cleaning steps reduced this to 9,286 strings that were likely to represent full names as opposed to other information such as usernames or e-mail addresses.
Of these, we were able to confidently infer a gender for 7,747 unique names. Similarly, the high-profile dataset contains 1,145 unique names, 881 after cleaning, and 775 for which we were able to infer a gender. We note that, based on manual observation, there may be a slight bias against identifying genders for non-anglophone names. We also note that a few individuals appear to be in the dataset more than once with different ways of writing their name, but these are very rare. We were able to confidently infer a gender for 83.4% of cleaned names in the main dataset and 88.0% of cleaned names in the high-profile dataset.

Analysis of developer and author gender
Code for this analysis is in paper/scripts/gender.Rmd, which also created Fig 5 .

Developers, commits, and paper authors by gender
For the gender breakdown of developers, we counted unique full names of authors and committers, collapsing people with the same name or login, and ignoring other identifying information such as email address. Although we could theoretically be falsely collapsing multiple individuals with the same name, we find that it is much more common for the same individual to exist in the dataset with multiple aliases. For commits, we joined commit records to genders by the full name of the commit author, and counted individual commits. For paper authors, we counted individual authorships on papers announcing the repositories.

Team composition
We analyzed team composition for the 504 projects in the main dataset for which we could infer a gender for at least 75% of developers (collapsing developers with the same name or login) and 75% of paper authors. We analyzed diversity for the 602 repositories in the main dataset for which we could infer a gender for at least 75% of developers. We defined team types as "solo female" if the team consisted of one woman, "solo male" if the team consisted of one man, "all female" if no individuals were identified as male (individuals with no gender call may actually be male), "all male" if no individuals were identified as female, "majority female" if more individuals were identified as female than male, "majority male" if more individuals were identified as male than female, and "equal" if the same number of female and male individuals were identified.

Gender diversity
We quantified gender diversity using the Shannon index [21] . The Shannon index was originally developed to quantify entropy in information theory and has been been widely used across a variety of scientific disciplines to measure diversity of categories within a set or population, including being used to quantify gender diversity in the social sciences [22,23] . We calculated the Shannon index for gender diversity within developer teams (defined as the set of unique individuals contributing to a particular repo) and within commits (defined as the gender of the author of each individual commit to a repo, where individual authors are counted once per commit).

Supplemental Section 8 : Commit dynamics
We defined project duration as the time span between the first and last commit timestamps (author commit date) for the repo at the time we extracted the data. This was accomplished by the script src/python/run_bq_queries_analysis.py. The script was run through the Perl pipeline by setting "run_bq_analysis_queries" to true in the config file. We identified the initial commit time for each file by taking the earliest timestamp of all commits touching the file; this was accomplished with the script src/python/gh_api_file_init_commit.py by setting "generate_file_init_commits" to true in the config file. Metrics describing monthly activity (mean commits per month, max consecutive months with and without commits, mean new files per month) are with respect to the number of months in the project duration. These were calculated in paper/scripts/repo_features.R. Fig 6 was created in paper/scripts/analysis.Rmd.

Supplemental Section 9 : Proxy for project impact
We defined the variable "commits after publication" to be true if the latest commit timestamp at the time we accessed the data was after the day the associated article appeared in PubMed.  [14] . β represents the probability of a given term being generated from a given topic. The figure shows top terms that are sufficiently exclusive to each topic. For each topic, the listed terms have the top ten β values such that β is at least four times the β value of the second highest topic for the term. (For example, the term "data", which has high β values for several topics, is excluded.) The reported number of repositories for each topic is the number of articles whose abstract has a γ value (probability of coming from the topic) of at least 0.25; articles may be associated with more than one topic. The topic labels were designated manually after examining the top terms. The figure was created in paper/scripts/topics.Rmd.

Fig B. Programming languages and article topics in the main dataset.
Each repository is associated with the article that announced it. We ran topic modeling on article abstracts; see Supplemental Section 4 . The size of each dot represents the total number of bytes of code in repositories in the main dataset whose corresponding article is associated with the given topic.
Only languages included in at least 50 main repositories are displayed. Articles can be associated with more than one topic; in that case, the code is counted separately for each topic.
The figure was created in paper/scripts/topics.Rmd.

Fig C. Article topics and journals in the main dataset.
The size of each dot represents the number of articles published in the given journal that are associated with the given topic. Only the ten most common journals are included. Articles can be associated with more than one topic; in that case, the journal is counted for each topic. The figure was created in paper/scripts/topics.Rmd. Repositories with no detectable license are counted under "NA". The figure was created in paper/scripts/analysis.Rmd.

Fig K. Commit message content in the main dataset.
We evaluate whether commit messages contain error-related keywords as defined in [25] . Commits are presented according to their relative timing with respect to the publication of the associated article (negative times are before article publication). Each dot represents all commits across the entire dataset for a 10-day interval with respect to the publication date. The figure shows an increase in overall commits approaching paper publication, but no disproportionate increase in bug fix commits as defined in [25] .