Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO)

Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.

Abstract Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.

Introduction
Biocuration captures information from the primary literature in a computationally accessible fashion. The biocuration process generates annotations connecting experimental data with unique identifiers representing precisely defined ontology terms and logical relationships. While the majority of existing annotations are computational predictions built on knowledge from human biocuration, manually curated annotations from published experimental data are still the gold standard for functional annotations [1]. Universal access to well-curated databases, such as UniProt and those maintained by model organism consortia, allows scientists worldwide to leverage computational approaches to solve pressing biological problems. New insights on complex cellular processes such as autophagy, cell polarity, and division can be clarified after assessing relationships in curated data [2][3][4]. The Gene Ontology (GO; http:// geneontology.org/) is an evolving biocuration resource that provides the framework for capturing attributes of gene products within 3 aspects or main branches: biological process, molecular function, and cellular component [5,6]. Importantly, connections can be made between model organism genes and human genes with comprehensive GO coverage [7]. Additionally, using GO data generates testable hypotheses in areas with little direct experimentation [8][9][10]. Application to high-throughput and systems biology, for instance, has led to insights and better methods for identification and analysis of the genes involved in cardiac and Alzheimer disease [11,12].
Without question, GO is a critical scientific resource, but manual annotation is an extremely labor-intensive process [13,14]. The pace at which the information is generated in the literature exceeds the capacity of professional biocurators to perform manual curation and the willingness of funding agencies to pay for a larger biocurator labor force [15]. Although the general Swiss-Prot protein database (https://www.uniprot.org/) model is one example of a scalable process for targeted manual and automated annotations from the (annotatable) literature, most fields are limited by low numbers of trained personnel and minimal participation from trained scientists [16,17]. The problem is most severe for communities studying organisms without a funded model organism database. Nevertheless, curation of the experimental literature from as many species as possible strengthens inference of function when there is substantial evolutionary conservation [18,19]. Several groups are developing tools to facilitate community engagement, such as the Gene Ontology Normal Usage Tracking System (GONUTS) site described here. These efforts stem from the realization that, while most scientists acknowledge the importance of data curation, it is hard to motivate individuals to volunteer their knowledge [20,21]. Spectacular crowdsourcing successes include the analysis of Shiga toxin-producing Escherichia coli [22], the solution of the structure of an HIV protease by the FoldIt player community [23], and science content within Wikipedia [24][25][26][27]. In other cases, high-profile community annotation efforts have been less successful [28], which we attribute to the disconnect between traditional incentives for funding and promotion in academia [29].
Here, we describe the successful implementation over nearly a decade of a university instruction-based model resulting in nearly 5,000 high-quality community annotations added to the GO database. This effort was motivated by the clear parallels between the foundational skills used by professional biocurators and the well-defined goals for undergraduate training [30]. A professional GO biocurator creates gene annotations by finding relevant primary literature, extracting information about normal gene function from it, and entering that information using the controlled GO vocabulary into online databases [31]. We demonstrate that university students, guided by their instructors, could accomplish similar tasks and perform community GO annotation while developing strong critical reading skills in a templated annotation task requiring rigorous reading of primary scientific literature.

Sustainable community member contribution via an online intercollegiate competition
To address the need for broader participation and expansion beyond model organism databases, we initiated an intercollegiate competition based at Texas A&M University mainly for undergraduate students, called Community Assessment of Community Annotation with Ontologies (CACAO). Here, we limit the discussion to details of the competition that are relevant to annotation as the specifics of teaching practice were previously reported [32]. Leveraging the GONUTS wiki platform (https://gowiki.tamu.edu/), a framework for experts not familiar with GO to annotate from literature in their field, teams of students (competitors) participate in the CACAO competition (Fig 1) [33]. Instructors (also called judges) assess all annotations entered by competitors for accuracy and completeness, then give feedback. Peer review by the competitors is incentivized by awarding points for challenges that correct an entry. Teams earn points only for correct annotations and challenges. The team with the highest points accumulated over the competition period wins. Vetted, high-quality annotations are then submitted to the GO Consortium database. CACAO quickly expanded, hosting 39 competitions over 8 years including 23 colleges and universities, with 792 community annotators and 50 judges. After reading 2,879 peer-reviewed journal articles, community members submitted 11,123 annotations to GONUTS (Fig 1). Following careful review through 2018, 4,913 diverse annotations were added to the GO Consortium database (Fig 1). Those annotations are maintained as mandated by updates or changes in the ontology.

Annotations generated through CACAO are diverse, novel, and specific
The 4,913 annotations contributed through GONUTS have spanned all domains of life plus viruses, with the majority being skewed toward eukaryotes, in particular model organisms among the chordates (human, mouse, rat, etc.), Streptophyta (plants including Arabidopsis), and Ascomycota (such as budding yeast) (Fig 2A). As only unique annotations are accepted, this demonstrates that community members can help fill gaps left by professional biocurators working for model organism databases. CACAO annotations also go beyond model organisms. The 614 annotations to viral genes made by CACAO participants represented 285 eukaryotic viruses and 384 viruses that infect bacteria (bacteriophages). Nearly half of the approximately 1,000 annotations listed for bacteriophages in QuickGO list CACAO as the source. Annotations for bacterial proteins make up only 5% of total GO annotations, but 30% of CACAO annotations. At the order level, the top 5 bacterial categories (Enterobacterales, Bacillales, Lactobacillales, Pseudomonales, and Vibrionales) are heavily studied Gram-negative and Gram-positive organisms of importance to microbiology research and the medical community. The microbial (virus and bacteria) entities herein described represent high genetic diversity and often serve as the basis for significant automated propagation to eukaryotic gene products. Thus, we conclude that not only do CACAO annotators fill gaps for model organisms, but also expand coverage to a wide array of otherwise poorly curated species.
CACAO participants annotate to a wide variety of specific terms (Fig 2B). The top 3 most used terms within each aspect, approximately 5% for biological process and molecular function or approximately 24% for cellular component, are only a small proportion of the total for that branch. While the cellular component terms used are relatively general (i.e., nucleus), the top process and function terms are near leaf-level and more specific, having few to no child terms. To better understand the level of detail captured in annotations made by CACAO users, we used GOATOOLS, a Python package for representing where terms fall within the ontology hierarchical graph [34]. Based on the variety of annotation types in our set (e.g., aspects and species), we selected a measure that counts the number of descendants (dcnt), or child terms, for each entry. Higher level terms will have a larger score and are considered general or global. More descriptive terms with no descendants, or leaf-level terms, are more precise or detailed and receive the lowest dcnt value. The dcnt analysis quantitatively demonstrates that CACAO annotations are made to specific terms (Fig 2C). That pattern is consistent with the way annotations were reviewed, where only the most specific term that could be chosen based on the details reported in the paper was counted correct. For comparison, we performed the same dcnt analysis on all manual GO annotations available through 2019. The distributions of dcnt values for GO annotations are broader and statistically different from CACAO within each aspect (Fig 2C). These data demonstrate that community users can contribute high-quality, precise, and scientifically relevant annotations to GO.

CACAO community curators enrich ontology development
The GO changes over time to reflect research progress and improve the representation of biological knowledge [35]. The GO Consortium tracks requests to change the ontology via their GitHub repository accessible on the Helpdesk (http://help.geneontology.org/). CACAO users have submitted >50 tickets via this system, resulting in the creation of 49 new GO terms,  . (B) The distribution of GO terms used for CACAO annotations are graphed by aspect within the ontology. The top 3 terms within each aspect are labeled on the outer ring. For clarity, "activity" was dropped from each function term, and the process terms were abbreviated from "positive/negative regulation of transcription, DNA-templated" to "transcript. reg., positive or negative." The number of GO annotations for each term is indicated in brackets. (C) The descendant counts, corresponding to depth within the ontology, for CACAO annotations (n = 4,913) and all other manual GO annotations in UniProt through 2019 (n = 255,958) are graphed. Significant differences measured by the Mann-Whitney test with p<0.001 are marked with an � . CACAO, Community Assessment of Community Annotation with Ontologies; GO, Gene Ontology.
https://doi.org/10.1371/journal.pcbi.1009463.g002 many of which now have child terms added by others. Given the diverse literature areas read by community curators, many of these terms are breaking new ground in the ontology. At time of writing, the new terms added based on CACAO feedback had been used >650 times by curators. In addition, at least 14 nonterm changes, such as clarified definitions and relationships for current terms, have also occurred. A beneficial, unintended consequence of CACAO is that curators are compelled to resolve issues within the ontology and incorporate new knowledge from areas that are not traditionally covered by model organism databases.

Community member annotations through CACAO add long-term value to GO
The GO resource is among the computational tools most cited by biologists [6]. Automatically inferred annotations, those made without curator intervention, are temporary but make up a significant dynamic proportion of the total GO annotations at any given time. However, the quality of computationally assigned annotations relies on a solid undergirding of manual annotations where the data are reviewed and then annotations are created by a dedicated biocuration community [36]. Of the >6 million manual annotations in the July 2021 release of GO files, approximately one-sixth of the human-curated manual annotations come from traditional experimental evidence, and most of the rest of them come from sequence similarity and phylogenetic similarity evidence [19]. The efforts described here are not meant to rival the volume produced by dedicated biocurators, nor to replace that organized effort. Instead, we demonstrate how small contributions from many individual community members over time accumulate into a unique and valuable resource. By virtue of its decoupling from the traditional funding model, community curation supplements professional biocuration, especially in underfunded areas [17].

Targeted crowdsourcing with attribution makes CACAO annotation sustainable
Recognizing the need to pull expertise from diverse bench scientists, various other initiatives have been implemented to encourage community participation with lower cost [37,38]. For example, the PomBase community curation project called Canto has garnered up to an impressive 50% response rate for co-annotation from authors within their community [38]. Another natural by-product of crowdsourcing is the diversification of the biocuration workforce. Such introduction of new expertise and perspectives is analogous to the workplace observation that diverse teams innovate and produce more than homogenous ones [39]. While the majority in the "crowd" may be unlikely to participate [40], the CACAO implementation of GONUTS is a sustainable model for community contribution of vetted GO annotations in areas of current interest because it caters to a nonrandom crowd, primarily students in an academic course setting.
In a resource-limited environment, the need to incentivize data curation has been creatively approached with different methods such as the micropublication format [41][42][43]. Yet, motivating researchers to weigh in on ontology structure is a long-standing challenge [20]. Recognizing the need to credit individuals for their annotation efforts, UniProt now offers a portal for submitting literature-based curation linked to an ORCiD (https://community.uniprot.org/ bbsub/bbsub.html) [44], as does the new Generic Online Annotation Tool built for the plant community (http://goat.phoenixbioinformatics.org/). Importantly, the GONUTS wiki provides a web-based public record of CACAO contributions, allowing individuals to cite their efforts.

CACAO contributions are valuable because they are unique
On the one hand, community curators can spend the time to read and extract information from redundant papers (those with information highly similar to already curated literature and conclusions) to enhance model organism annotation depth and increase confidence in existing annotations. On the other hand, community curators sample from a vast literature space outside the typical biocurator's expertise, expanding overall organism coverage, such as shown for microbial organisms here [45][46][47][48][49]. Because microbial genomes are typically smaller, groups of students can make a major contribution. A significant instance is adding approximately 50% of all phage GO annotations available in the GO annotation files. CACAO has also spurred updates to ontology relationships. For example, a large rearrangement of biofilm GO terms occurred after CACAO users initiated discussion about their parentage and definitions.

Community curation through CACAO meets modern open-source research and education goals
With online education thrust to the forefront during the global Coronavirus Disease 2019 (COVID-19) pandemic, sustainable and authentic education-driven engagement solutions are critically needed [30, 50,51]. Community-driven skills-based classroom research in any number of formats (e.g., CACAO, genome annotation [52][53][54]) serves the scientific community. From an educational perspective, the competition aspect is an engaging format that models real-world scientific skill development with regard to critical reading, iterative editing of a product, and peer review. We hypothesize that this mini biocurator experience may have similar benefits for recruitment, retention, and graduation observed with undergraduate research [55,56]. The biocuration model is highly applicable to scientists and trainees worldwide and complies with Findable, Accessible, Interoperable, and Reusable (FAIR) [57] data principles, making its results accessible to all. GO annotation for Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) and its infection of human cells was immediately pursued to aid strategic planning of the pandemic response (http://geneontology.org/covid-19.html). We appeal to scientists to participate in biocuration efforts through GONUTS, UniProt, or a model organism database/the Alliance of Genome Resources where users can contribute from the comfort of any computer [58].

Materials and methods
CACAO competitions for intercollegiate teams are hosted on GONUTS (https://gowiki.tamu. edu/). Raw data for all users and every annotation history are maintained by custom extensions to the MediaWiki software used by GONUTS [33]. Additional information about competition rules can be found at https://gowiki.tamu.edu/wiki/index.php/Category:CACAO. The data presented here encompass annotations generated from 2010 to 2018, with expanded taxon information retrieved using the UniProt application programming interface (API) as well as the ETE (v3.1.1) module and various tools from BioPython (v1.74) [59,60]. Summary statistics for CACAO annotations given in Fig 1 were mined from our local database storage.
Fully correct annotation data are transferred from GONUTS regularly via the current Gene Association File (GAF) or Gene Product Association Data (GPAD) file format, as outlined in GO requirements, directly to the European Bioinformatics Institute's Protein2GO for incorporation into the complete GO annotation files. All currently included annotations are accessible on GONUTS or via the search engine QuickGO (https://www.ebi.ac.uk/QuickGO/ annotations) by filtering for parameter "assigned by" CACAO and are also provided as a Supporting information dataset in GPAD format (S1 File) [61].
The 01-01-2020 non IEA GAF (goa_uniprot_all_noiea.gaf.gz) and ontology file (go.obo) were downloaded from http://release.geneontology.org/ for the dcnt analysis. Values for dcnt were calculated according to GOATOOLS on all manual annotations not assigned by CACAO [34]. The Mann-Whitney test with a 2-sided p-value was used to compare GO and CACAO dcnt distributions within each aspect using SciPy [62,63].
For the phage analyses, the GAF was filtered into a subset using the following TaxIDs from the NCBI Taxonomy  Changes to the ontology initiated by CACAO users were tallied by searching through the GO issue tracker at GitHub (https://github.com/geneontology/go-ontology/issues) for user handles: @jimhu-tamu, @suzialeksander, @sandyl27, @jrr-cpt, @ivanerill, and/or the query text "CACAO" for open and closed issues, then manually reviewed for accuracy. Matplotlib (v3.1.1) and Seaborn (v0.9.0) were used to generate pie charts, box plots, and bar graphs [64,65].