Improving big citizen science data: Moving beyond haphazard sampling

Citizen science is mainstream: millions of people contribute data to a growing array of citizen science projects annually, forming massive datasets that will drive research for years to come. Many citizen science projects implement a “leaderboard” framework, ranking the contributions based on number of records or species, encouraging further participation. But is every data point equally “valuable?” Citizen scientists collect data with distinct spatial and temporal biases, leading to unfortunate gaps and redundancies, which create statistical and informational problems for downstream analyses. Up to this point, the haphazard structure of the data has been seen as an unfortunate but unchangeable aspect of citizen science data. However, we argue here that this issue can actually be addressed: we provide a very simple, tractable framework that could be adapted by broadscale citizen science projects to allow citizen scientists to optimize the marginal value of their efforts, increasing the overall collective knowledge.


Introduction
In October 2018, Corey traveled to Malaita, Solomon Islands, with the Australian Museum to conduct a biodiversity assessment with the local Kwaio people (https://australianmuseum.net. au/blog/amri-news/solomon-islands-ornithology/). While there, he submitted 66 eBird checklists, comprising 650 bird observations. He submitted the very first record of Malaita Dwarf-Kingfisher to the growing database-a database with >600 million observations comprising approximately 99% of the world's bird species. Traveling to a remote part of the world to survey birds was truly a "once-in-a-lifetime" opportunity. But submitting eBird checklists from his smartphone-one example of a citizen science contribution-was simply part of his daily routine.
He is not alone. Citizen science is now mainstream, with hundreds of thousands of participants worldwide contributing observations of the natural world to various citizen science projects daily. Among the most popular projects, accumulating millions of observations annually, are those in which citizen scientists record the numbers and types of organisms observed [1,2].
But what is the difference between submitting an eBird checklist from a remote part of the world and submitting an eBird checklist while walking the dog near his home in Sydney, as Corey does most days? Is one inherently more "valuable" to the database than the other? In this paper, we examine this question, highlighting that not all citizen science observations are created equal. We argue that there is room for improvement in large-scale citizen science collection schemes and provide a conceptual framework to assign value to citizen science observations.

Citizen science is mainstream
Citizen science projects-scientific research conducted in whole or in part by people for whom science is not their profession-are incredibly valuable for society [3], with their importance to scientific research growing each year [4]. These projects provide educational opportunities [5], increase scientific knowledge [6], and collect immense amounts of raw information about biodiversity in the world [7,8]. Citizen science data allow larger spatial and temporal scales for key research questions in many fields, including environmental toxicology, medicine, nutrition, astronomy, and biodiversity research [2,9]. In this essay, we focus on citizen science projects in which the main intent is to collect broadscale biodiversity data, but the arguments apply to any citizen science projects that sample in space and time.
Each citizen science project aimed at collecting broadscale biodiversity data falls along a continuum, from unstructured to structured, based on the objectives, survey design, flexibility, rigorousness, and detail collected about the observation process [10,11]. Projects with clear objectives, clearly planned data analysis, and rigorous protocols, for instance, are classified as structured projects. Conversely, projects with open and flexible recruitment and a general lack of protocols are classified as unstructured projects [10]. Many projects fall along this continuum and are thus classified as semistructured [10,11]. Examples of such projects, and their associated level of structure, include iNaturalist (unstructured; [12]), eBird (semistructured; [7]), eButterfly (semistructured; [13]), FrogID (semistructured; [14]), and the UK Butterfly Monitoring Scheme (structured, [15]). Despite their level of structure in data collection, each of these projects has a specific aim: to collect observations of a unique taxon along with spatial and temporal data. We define one of these observations as a biodiversity sampling event (BSE).
A well-known feature of the data from these broadscale citizen science projects is the patchy distribution of BSEs across space and time [2,16,17], differing based on the level of structure of a project [10]. This leaves global citizen science datasets with spatial and temporal gaps and redundancies [18,19]. Other biases associated with citizen science projects include interobserver skill differences [20] and taxonomic biases [21,22], influencing the data validation/quality of a specific citizen science project [23]. Data collection biases can, in some cases, be minimized with certain statistical techniques [18,[24][25][26]. For example, interobserver skill differences can be accounted for in species distribution models [24]. Or sampling strategies and protocols can be enhanced [18], whereby citizen science projects transform from unstructured to more structured projects throughout the life of the project [10]. Additionally, data can be filtered or subsampled to deal with error and uneven effort [27,28], pooled among species [29], and augmented with data with a known sampling effort [30]. More complicated machine learning and hierarchical clustering techniques also exist, allowing for investigation of the relative importance of a large number of explanatory variables [31][32][33]. Generally, sophisticated methods are preferred to simple methods when accounting for biases in citizen science data [25,34]. This is especially true for unstructured and semistructured citizen science projects, which collect some information on potential biases that can then be accounted for. Conversely, fewer biases need to be accounted for when using data from structured citizen science projects, and thus, simpler statistical techniques can be appropriate. But importantly, although some approaches can deal with biases for particular questions, none of these approaches can ultimately increase the information content in the data. This can only occur in the data collection process.
Different citizen science projects, based on the level of structure associated with data collection, necessitate different statistical approaches to minimize bias that arises from patchy biodiversity data. As an example, continental-and hemispheric-scale species distribution models derived using eBird data account for spatial bias by aggregating data in grids [35,36] while modeling differences in individual observer skill levels [20,24] of those who collect the data. These data are now being used to produce species-specific range maps with estimates of abundance (e.g., https://ebird.org/science/status-and-trends/). Data from the UK Butterfly Monitoring Scheme have produced reliable trend estimates for 62 butterfly species and accounted for sampling intensity by using a subsampling analysis [15]. There are a variety of methods to account for biases in citizen science projects at various parts of the data collection protocol. Biases have been minimized at the time of data collection by providing very structured protocols for projects that target specific monitoring areas and times-e.g., seagrass research [37]. Biodiversity data have been crowdsourced using an incentivized "reputation system" to motivate and reward participants who identify species, and critically, these data were groundtruthed by professionals, showing a 92% accuracy rate [38]. Hidden Markov models have been used to identify insect recordings in real time [39]. And many projects generally use active encouragement to collect large amounts of volunteered geographic information [40].

Characterizing the value of biodiversity sampling events
Given the vast potential of citizen science monitoring schemes [2,41,42], methods to decrease patchiness and increase information in the data are crucial. Ultimately, this will help improve the confidence in downstream analyses. We foresee the following points to be critical in order to improve citizen science sampling for broadscale biodiversity projects: • shift away from taxa-specific approaches and begin to incentivize looking in space and time, rather than finding • implement a conceptual framework and associated algorithms that suggest high-marginalvalue sampling sites to participants • provide participants with incentive to contribute in the most meaningful manner

Optimal sampling of biodiversity in space and time?
To maximize the value of each citizen scientist's effort, we first have to answer a key question: Is a one-off trip to a remote part of the world more valuable than daily observations while walking a dog? Or, in general terms, what is the marginal value of each event to the overall project? This is specific to the questions researchers will ask using these data. Are the intended outcomes of the citizen science project aimed at producing reliable species distribution models? Or do the outcomes revolve around producing reliable population trends for a given management area? If the former, then a preference may be placed on homogeneous or stratified sampling in space, but if the latter, a preference might include less spatial sampling but longer time series at fewer sites. Inevitably, there are inherent trade-offs in spatial and temporal sampling, depending on the questions of interest. Projects with high spatial resolution of BSEs are beneficial for species distribution models [43,44], niche breadth [45], biodiversity measurements [9,46], and phylogeographical research [47]. Conversely, projects investigating detection probabilities [48], full-annual-cycle research [49,50], invasive species detection [51,52], and population trends [53,54] benefit from high-temporal-resolution BSEs.
Regardless of potential questions that will be asked by researchers, or intended goals of a citizen science project, there are some general principles in sampling design-relating to sampling in space and time-that can be applied to improve the structure of the data for many, if not all, future questions. We will first explore the relatively simplistic case of spatial resolution, followed by the more complicated instance of temporal resolution, before treating them both simultaneously.

Spatial resolution
The simplest scenario for sampling global biodiversity would be to distribute BSEs homogeneously around the globe. Given species-area relationships [55,56] and the scale dependence of sampling biodiversity [57], the value of a BSE, given a preexisting set of BSEs, should be directly proportional to the distance between it and the nearest BSE. In other words, the information content that a given BSE adds to a collective dataset would be maximized by the distance between it and all other BSEs. But biodiversity is not homogeneous around the globe, and thus, BSEs should be stratified by habitat/biome, relative to overall biodiversity. Furthermore, organisms within and among taxa are not detected equally [58,59], making multiple BSEs at a given site (i.e., temporal replication) necessary for understanding local biodiversity [48,60], and habitat/biome definitions are debatable, suggesting that systematic sampling in space is neither achievable nor desirable.

Temporal resolution
Temporal resolution, by necessity, may be thought of as analogous to an additional spatial dimension: temporal replication has to take place at a particular site. We do not provide a specific, rigid definition for site, as the definition will be highly dependent on the specific citizen science project. At the finest resolution, site could be equated to a particular BSE (i.e., unique latitude and longitude coordinates), or it could be a management unit of some spatial relevance (e.g., an urban greenspace, national park, county, state). If the latter, then spatial sampling would likely need to be applied within a specific "site" (i.e., multiple BSEs within a national park would be necessary). For better understanding of biodiversity changes, we should aim to increase the temporal replication of BSEs at a site. The sampling of every site can be visualized as a distribution that represents the sampling interval between BSEs. Wide variation will exist among sites, but the ultimate goal is to achieve a specific desired sampling interval between BSEs-left-shifting a particular site's distribution of sampling intervals-or, in other words, decreasing the median and mean sampling interval for a site. Instead of many participants (or a single participant) visiting a single, well-sampled site (i.e., pseudoreplication), the visitation of a site can be optimized so that the tail of the distribution of all sites is left-shifted. Thus, the value of a BSE at a site would be related to the desired sampling interval and the time since the last sample. In other words, a BSE at a site that hasn't been sampled in a month is marginally more valuable than a BSE at a site that was sampled the previous day, dependent on the desired sampling interval at a site. Marginal values are dynamic, as new BSEs are continuously submitted to a citizen science project.

Spatial and temporal resolution
The ultimate goal in the future of broadscale biodiversity citizen science projects should be to increase spatial resolution while simultaneously increasing temporal replication at sites, balancing inherent trade-offs in spatial and temporal resolution. Ultimately, ecological and conservation outcomes that combine both spatial and temporal data can be achieved [9,61]. And information at spatial and temporal scales are the necessary types of data for broadscale conservation prioritization.

Forecasting the value of future BSEs
Here, we provide a simple but general framework to forecast the marginal value of future BSEs. This framework requires a desired outcome: defining which specific information about a species or group of species is important for conservation or basic science (e.g., species distribution models or trend detection). There could be many desired questions, dependent on specific management goals. The management goals can also define the species or species pools of special interest-for example, all migratory birds in a national park or one specific species that is highly threatened. That goal defines a statistical model, and then, within that model, the contribution of each individual BSE can be quantified using the statistical concept of leverage [62]. High-leverage BSEs are useful for the desired outcome in that they are very important (i.e., influential) observations for the model, whereas low-leverage BSEs are less useful. The goal of future sampling can then be defined precisely: encourage a shift from low-leverage, low-value BSEs to high-leverage, high-value BSEs.
Of course, the distribution of biodiversity in time and space is not deterministic, and so we cannot predict the exact leverage of future BSEs. We can, however, predict the "expected leverage." To find this, we look at the past: because the desired outcome is an improved statistical model, then for past data, it is simple to calculate the leverage for each BSE. To determine the effect of space and time on statistical leverage in the past, leverage values can be regressed against a suite of potential spatial and temporal variables, which are likely to influence the outcomes of the intended statistical model. Examples of forward-looking parameters we find important in space and time include (1) whether the site was sampled, (2) the distance to the nearest sampled site, (3) the median sampling interval of BSEs, (4) the median sampling interval of the site's nearest neighbor, and (5) days since the last BSE at the site. A number of other variables could also be included in this framework, including observer skill, time of day, and weather, but we focus on the variables generalized across varied structured to unstructured citizen science projects.
In S1 Text, we present a stepwise approach to calculating these variables and, thus, the marginal value of a given site in space and time. The actual parametrization of these variables will depend on the statistical model of interest. For example, for a species distribution model, the highest-valued BSE is likely to be the furthest away from other BSEs, whereas for phenological questions, the highest-valued BSEs are likely to be related to the time since the last sample at a site. In theory, expected high-leverage sites shortly after sampling would become expected low-leverage sites; then, if they are not sampled through time, the expected leverage creeps upward again. This process repeats throughout the landscape, providing a dynamic map of expected BSE values into the future (S1 Text).

A dynamic system of incentives
Our framework would be able to update on a monthly, weekly, daily, or even real-time basis, dependent on the taxa in question and the participation rates of the citizen science project. This dynamic aspect of citizen science projects is not novel. Many citizen science projects dynamically provide feedback to participants [27,[63][64][65], often in the form of leaderboards, creating either friendly competition or a sense of self-competition by providing participants with performance feedback [27,66]. These tools may help sustain engagement over time [67], and similar approaches of incentivizing more sampling have proven successful with eBird participants [68].
In order to maximize observation value, we propose that leaderboards should not dwell on "total number of records" or "total number of species" but should also incorporate "overall value of participants' observations." One key concern about incentivizing science is to maintain data quality while directing effort more productively. To this end, we argue that incentives should be placed on sampling particular places at particular times, not actually finding specific species. To date, this is not common practice: other citizen science projects reward finding particular species (e.g., Questagame; https://questagame.com/). We argue that incentivizing looking rather than finding would ultimately decrease the ability of participants to "game the system" [69]. There is less likelihood of biases to exist resulting from individuals preferentially chasing rare species. We envision an approach that would incentivize the concept of submitting more "valuable" BSEs, encouraging participants to travel to sites that are prioritized based on the marginal value of a BSE from a specific site. Such an approach could see the following workflow (Fig 1): Step 1: Citizen science participants could opt into the "challenge," protecting privacy concerns.
This ensures that only participants who are interested in participating would potentially get push notifications and enter potentially sensitive information about where they are willing to sample (see Step 2).
Step 2: Participants could provide a point (and an associated radius) on a map, indicating their preferred sampling area from which to submit BSEs. This could be flexible (e.g., intraweek variation), and when a participant is in a new area (i.e., on a vacation), they could provide updated areas. Step 3: All potential sampling locations within the user-specified sampling domain could be selected, normalized, and ranked based on the citizen science project's preferred weighting of the formula. This could then be presented as a map showing the highest-valued sites. An important first step would be to use spatial datasets to delineate the boundaries of private land tenure and sensitive areas, ensuring that participants are not encouraged to trespass on private property or disturb sensitive habitat. We also envision an approach in which users could opt in to receive push notifications that provide a ranked list of sites. This would be dynamically updated based on observations submitted on a given timescale.
Step 4: A given BSE could then be assigned "points," and a leaderboard displaying the participant's value of their submitted BSEs could be developed, encouraging further participation.
Points could be assigned based on the prioritized site list, with the highest prioritized site receiving the most points, through to the least prioritized site (e.g., S1 Text). All BSEs submitted could be quantified, whereby any opportunistic BSE still receives a value, but just proportional to the potential value, should the participant decide to go sample at a site with the highest marginal value. The leaderboard would need to be normalized to the density of participants in a given area to keep participants' scores on a comparable scale.

Working within real-world constraints
Real-world constraints will inevitably limit the move toward optimal sampling of biodiversity in space and time. First, people are unevenly distributed across the globe, and wealth and literacy of the global population likely influence participation rates in citizen science projects. Second, groups of organisms (e.g., birds, fish, invertebrates) vary in their popularity with the general public both among [21] and within [22] taxa, influencing the level of participation in citizen science projects. Observations of some species (i.e., those that are less "popular" with the general public) may, therefore, be more valuable than others, but we do not include a species-specific resolution in our framework for assigning value to BSEs, because the species being detected cannot be predicted. Rather, the probability of a species being detected is a function of spatial and temporal sampling [70,71]. Third, citizen scientists are also more likely to sample in convenient locations. Indeed, one reason for the vast success of semistructured citizen science projects is the relative ease of data collection by the participants (i.e., few protocols to follow). Finally, not all citizen scientists contribute equally valuable contributions. This results from a difference in skills among observers-which can be accounted for during analyses [20]-and a difference in their dedication to a particular project. Moreover, motivations of participants vary among projects [72,73]. For at least a subset of citizen scientists, a primary motivation of participation is to contribute to science [72,73], suggesting that these participants are likely willing to improve their sampling, knowing it would benefit science. And there are potential education opportunities that can improve participants' knowledge of how data are used by citizen science projects [5]. Other participants could be incentivized by providing "leaderboards" of participants with the most valuable BSEs. Although we cannot account for all of the above constraints, we can attempt to maximize the collective citizen science effort by optimizing when and where people sample biodiversity.

Conclusions
Citizen science is mainstream, and research will increasingly use citizen science data at least in part to increase the spatial and temporal context of our research efforts. But are we maximizing the absolute power of the vast number of citizen scientists contributing to our collective knowledge? We think not. Although we provide intuitive and simplistic conceptualization, we acknowledge that this is only one way in which to "value" a given citizen scientist's effortwhich will ultimately be dependent on the project's design and intended outcomes. Our goal in writing this piece is simple: we urge those developing and overseeing citizen science projects to think critically about refining data collection techniques, realizing the full potential for citizen science.
Supporting information S1 Text. An example of a dynamic system to calculate marginal value. A guide to how we envision a dynamic system that can be used to calculate the value of BSEs in the future, broken down by steps, with figures and tables. This is intended as an example of how our framework could be implemented and is not intended to be prescriptive. We show it for a given, particular date, but this would be calculated on an updated, dynamic basis. BSE, biodiversity sampling event. (PDF)