Comparing the Quality of Crowdsourced Data Contributed by Expert and Non-Experts

There is currently a lack of in-situ environmental data for the calibration and validation of remotely sensed products and for the development and verification of models. Crowdsourcing is increasingly being seen as one potentially powerful way of increasing the supply of in-situ data but there are a number of concerns over the subsequent use of the data, in particular over data quality. This paper examined crowdsourced data from the Geo-Wiki crowdsourcing tool for land cover validation to determine whether there were significant differences in quality between the answers provided by experts and non-experts in the domain of remote sensing and therefore the extent to which crowdsourced data describing human impact and land cover can be used in further scientific research. The results showed that there was little difference between experts and non-experts in identifying human impact although results varied by land cover while experts were better than non-experts in identifying the land cover type. This suggests the need to create training materials with more examples in those areas where difficulties in identification were encountered, and to offer some method for contributors to reflect on the information they contribute, perhaps by feeding back the evaluations of their contributed data or by making additional training materials available. Accuracies were also found to be higher when the volunteers were more consistent in their responses at a given location and when they indicated higher confidence, which suggests that these additional pieces of information could be used in the development of robust measures of quality in the future.


Introduction
The proliferation of Web2.0 technology over the last decade has resulted in changes in the way that data are created. Individual citizens now provide vast amounts of information to websites and online databases, much of which is spatially referenced. The analysis and exploitation of this georeferenced subset of crowdsourced data, or what is more commonly referred to as volunteered geographic information (VGI) [1,2], has the potential to fundamentally change the nature of scientific investigation. Citizens have a long history of being involved in scientific research or the more recently coined 'citizen science' [3]. There are many successful examples of citizen science that have led to new scientific discoveries, including unravelling protein structures [4] and discovering new galaxies [5], as well as websites for public reporting of illegal logging/deforestation [6] and waste dumping [7], which have demonstrated how citizens can have a visible impact upon the environment and local governance. Analysis of more passive sources of geo-tagged data from the crowd from search engines such as Google has also revealed interesting scientific trends, e.g. the relationship between GDP and searches about the future [8], trends in influenza [9] and the ability to characterize crop planting dates [10]. One of the critical advantages of VGI is the potential increase in the volumes of data about all kinds of spatially referenced phenomena. Such data can be collated and used for many different scientific activities: from the calibration of scientific models (e.g. economic prediction models that require information about land use) to the validation of existent data (e.g. maps derived through Earth Observation).
With improved connectivity via mobile phones and the use of low cost, ubiquitous sensors (e.g. those which directly and instantaneously capture data about their immediate environment), the opportunities to exploit such rich veins of VGI are many and varied. However, whilst one of the pressing challenges concerns how to manage large data volumes in terms of processing and storage, a number of yet unaddressed issues persist. These include how to handle data privacy, how to ensure adequate security, and critically, how to assess VGI data quality. Data quality is an area that has attracted increasing attention in the literature [1,[11][12][13]: quantifying VGI data quality underpins its usefulness (that is, its reliability and credibility) and potential for incorporation into scientific analyses. The critical issue is whether ordinary citizens can provide information that is of high enough quality to be used in formal scientific investigations.
With open access to high resolution satellite imagery through providers such as Google Earth and Bing Maps, it is possible to collect vast amounts of volunteered information about the Earth's surface such as land cover and land use. The collection of crowdsourced land cover data is the main aim of the Geo-Wiki project [14,15] in what is currently a contributory approach to citizen science [16]. Geo-Wiki is a web-based geospatial portal (http://www.geo-wiki.org) with an interface linked to Google Earth. It can be used to visualize and validate global land cover datasets such as GLC-2000, MODIS and GlobCover [12] which frequently disagree over the land cover they record at any given location [17][18][19]. Since its inception, a number of Geo-Wiki branches have been initiated, each one specifically devoted to gathering different types of information such as agriculture (agriculture.geo-wiki.org), urban areas (cities.geo-wiki.org), biomass (biomass.geo-wiki.org) and more recently human impact (humanimpact.geo-wiki.org).
The general aim of this paper is to determine whether there are significant differences in quality in the information contributed by experts and non-experts. This is explored through a land cover case study with obvious implications for the domains of remote sensing and landscape analyses and investigation of the extent to which VGI can be trusted as a source of training and validation data in remote sensing. However, by investigating generic research questions related to the quality and reliability of information contributed by citizens with different levels of domain expertise, this research should also be of interest to the broader field of citizen science. The next section describes data collection via the human impact Geo-Wiki campaign and the analysis of volunteer and volunteered data quality. Following the results, some discussion is provided regarding the implications of incorporating VGI in scientific research including recommendations for further research before conclusions are drawn in the final section.

Data from the Human Impact Competition
Crowdsourced data on land cover were collected using a branch of Geo-Wiki called Human Impact (http://humanimpact.geowiki.org) and the data were subsequently used to validate a map of land availability for biofuel production [20]. The volunteers were presented with pixel outlines of 1 km resolution (at the equator) projected onto Google Earth (where pixels in this context refer to the smallest area for which information is collected) and were then asked to determine the percentage of human impact and the land cover type at each location from the following list: (1) Tree cover, (2) Shrub cover, (3) Herbaceous vegetation/Grassland, (4) Cultivated and managed, (5) Mosaic of cultivated and managed/ natural vegetation, (6) Flooded/wetland, (7) Urban, (8) Snow and ice, (9) Barren and (10) Open Water. The concept of 'human impact' was defined as the amount of evidence of human activity visible in the Google Earth images. A spectrum of these intensities is shown in Table 1, which is loosely based on the ideas of Theobald [21]. Volunteers were also asked to indicate their confidence in the class type and the impact score, whether they had used high resolution imagery and the date of the image. Volunteers were recruited by emails sent to registered Geo-Wiki volunteers, relevant mailing lists and contacts, in particular those with students, and through social media. Background information on the competitors was collected through the registration procedure. The competition ran for just under 2 months in the autumn of 2011 [22]. The top ten volunteers were offered coauthorship on a paper resulting from the competition [20] as well as Amazon vouchers as an incentive. Other incentives included inviting friends, which resulted in extra points, a leader board so that competitors could gauge the competition, and appealing to the environmental motivation of individuals through the biofuel theme.
A set of 299 'control' points was used to determine quality where three experts with backgrounds in physical geography, geospatial sciences, remote sensing and image classification agreed upon the land cover at each location. The first 99 control points were provided to the volunteers at the start of the competition, the next 100 were provided three-quarters of the way through and the final 100 were provided at the end, where the latter were drawn from higher resolution imagery. The volunteers were then ranked by an index that combined quality and quantity through equal weighting, and the top ten were declared the winners. Interestingly, there were some minor changes in the top ten once quality was considered.
A total of ,53,000 locations were validated by more than 60 individuals and Figure 1 shows the rapid increase in contributions in the last 20 days of the competition, with a particularly large spike at the end. Figure 2 illustrates the spatial distribution of the ,53,000 points collected expressed as measures of human impact and land cover. Note that the crowdsourced data can be freely downloaded from http://www.geo-wiki.org.
Of these ,53,000 validations, 7657 were at the control locations, which were then used to assess quality. The data were then filtered for 'unknown' expertise resulting in 4020 control data points scored by 29 Expert volunteers and 3548 control data points scored by 33 Non-expert volunteers. Experts were considered to be individuals with a background in remote sensing/spatial sciences versus non-experts who were new to this discipline or had some self-declared limited background. The control data, whose analysis forms the basis of the paper, have the following characteristics. Experts evaluated an average of 64.8 control data points each (s.d. 108.1) and non-experts 57.2 (s.d. 95.1). Although there is the potential for a few individuals to have a disproportionately large impact on data quality and composition, in this case, of the 29 experts, 18 contributed more than 50 evaluations, and of the 33 non-experts, 19 evaluated more than 50 data points. The volunteers' demographics (age, gender, socioeconomic status etc.) were not captured as part of the contributor registration. This is unfortunate, because although a proxy for previous experience is evaluated in this paper, it is well recognised that such factors can influence contributor responses. Such data will be collected in future campaigns.

Analysis of Human Impact
To determine how well the answers provided by the volunteers matched the control data in terms of the degree of human impact, a linear regression was fit as follows: where Y i is the degree of human impact from the control data, X i is the degree of human impact from the volunteers, a and b are  coefficients of the linear regression equation and e i is a normally distributed random error term for each observation i.
Each volunteer provided information on expertise during registration. Equation 1 was extended to include an indicator of respondent expertise in the regression model: where, in addition to the previously defined variables, b X is the regression coefficient for volunteer human impact, E i is the expertise indicator variable for observation i (0 for Non-Expert, 1 for Expert), and b E is the regression coefficient for this variable. Thus, this coefficient is a measure of the difference in human impact (on aggregate) between the Non-Expert and Expert contributions. This model implicitly assumes human impact is equally predicted by experts and non-experts (i.e. is uniform), and assumes a uniformity of the intercept term within each expert group, if the intercept is considered to be a for the non-expert group, and a+b E for the expert one. The data provided by the volunteers were then analysed for consistency, which is a known issue in ground truthing [23]. After every 50 points, the volunteers were provided with a point they had previously validated. The average, median and standard deviation of the maximum difference between the volunteers and the controls were calculated for all control points, by expertise, by volunteer consistency in the land cover they recorded, and by confidence.
Finally, the response times of the volunteers were calculated between each successive data point they scored. The median response time was 55 secs with a first and third quartile of 32 and 100 secs respectively. The average response time was 5,226 secs, indicating a highly skewed distribution, which reflects large pauses in contributions, e.g. at the end of a validation session. Figure 3 shows the median response time per day over the course of the competition. There is a general trend towards shorter response times as the competition unfolded with the shortest response times between successive validations occurring at the end of the competition. Thus, we were interested in understanding the relationship between response time and quality of the human impact responses overall and whether there was any difference in quality towards the end of the competition.
The response time data were first pre-processed in two ways. First, all response times greater than 5 minutes were removed as these were deemed unrepresentative of typical behavior. This was based on visual inspection of the distribution. However, 5 minutes also represents the 92.5 th percentile and therefore includes the majority of the data. Second, response times were log transformed due to the skewness of the distribution. A linear regression equation of the form given in (1) was fit to the entire dataset where the dependent variable, Y i , was the absolute difference in the answers for human impact between the control data and the volunteers' scores, and the independent variable, X i , was the log of the response times, with a and b representing coefficients of the linear regression, and e i the error term for each observation i.
The last 100 control points provided to the volunteers at the end of the competition were locations of cropland or agricultural land covers (the classes of Cultivated and managed and Mosaic of cultivated and managed/natural vegetation) and where high resolution images existed. In order to evaluate how volunteer performance changed with experience, only control points with agricultural land cover and where high resolution images were available were selected from the first 199 control points. The average accuracy in human impact across the first two control sets was then compared to the average accuracy of the third set using a t-test to determine whether there were any significant differences.

Analysis of Land Cover
As in the analysis of human impact scores above, control points were used to evaluate volunteer accuracy in terms of the land cover they indicated. An error or confusion matrix was populated for all contributors (Table 2) and the overall accuracy was calculated as follows: 07 Accuracy~P n i,j~1 where i is the volunteer class, j is the control class and n is the total number of classes. In addition, two other measures of accuracy were calculated, specific to each land cover class: user's and producer's accuracies. User's accuracy describes errors of commission or Type I errors. For example, the user's accuracy for the forest class indicates the likelihood that what was labeled as forest by the volunteers really is forest. Producer's accuracy reflects errors of omission or Type II errors. Using the forest example again, this measure reflects how well the forest cover control pixels were classified by the  volunteers. These two measures are calculated as follows: User 0 s Accuracy(by classi)~x Producer 0 s Accuracy(by classj)~x where i is the volunteer class, j is the control class and n is the total number of classes. Separate accuracy measures were calculated for the three sets of control pixels (to determine whether accuracies change over time) for locations where the volunteers were the most confident and to compare experts and non-experts.
Contributor consistency in land cover labeling was then analysed by determining the proportion of times when the same land cover type was chosen when presented with the same data point. This was calculated for all points, by expertise, and by various degrees of confidence.
Finally, the impact of response time on the quality of land cover validations was analysed using logistic regression of the following form: where the probability (P i ) that the land cover is correctly identified is expressed as a function of response time, X i . The effect of response time on accuracy in the final set of controls was compared with the first and second set to determine whether contributors were more interested in scoring a greater number of points and spent less time on each data point towards the end of the competition. A two-tailed binomial test was used to test whether the number of correct classifications at the end of the competition was greater than expected based on the total number of classifications performed and the probability of correct classification in the earlier part of the competition.

Human Impact
The result of the regression described in Equation 1 to determine how well the degree of human impact can be predicted by the contributors based on the control points is provided in Table 3. This shows that b differs significantly from zero and is positive but less than 1 suggesting that there is evidence that the users underestimated the degree of human impact by roughly 30 percent.
The results of including an indicator variable describing respondent expertise (Equation 2) are shown in Table 4. The slopes are still positive and suggest that allowing for expertise even in a simple way changes the results of relating to the slope term. To investigate this further, Equation 1 was extended to include variables describing expertise. Although computed together, this effectively splits the regression into two models -one for each of the expert groups -and the results are shown in Table 5. These results indicate that there is little variation in the degree to which the expert and non-expert group underestimated the degree of human impact. Figure 4 shows the distribution of human impact scores for the control pixels and the contributor data by land cover class. It shows a general trend for contributors to underestimate the degree of human impact across the different land cover types with the exception of (5) Mosaic of cultivated and managed/natural vegetation.
A further analysis explored how human impact scores varied with land cover class. The standard regression described in Equation 1 was extended to include indicators for the land cover classes. Since there was only a small number of data points classified as Open water, Barren or Urban, these classes were excluded from the regression analysis. The results for the remaining five land cover types are shown in Table 6 and Figure 5 plots the contributed against the control human impact scores with the regression coefficients for different land cover classes.
The results show that the prediction of the degree of human impact varies with land cover classes. The coefficients for the Herbaceous vegetation/Grassland class most strongly predict human impact, the coefficients for the Shrub cover class are the weakest predictors and all classes underestimate human impact. This indicates that the conceptualizations of these classes may need to be more clearly defined and perhaps more training examples used to illustrate the different degrees of human impact by land cover type. Table 7 shows the results of the consistency analysis. Overall the contributors were consistent in their answers regarding the degree  of human impact, with an average deviation of less than 10% (i.e. 9.6%) although the spread of answers was higher at 17.4%. When expertise was considered, non-experts had a lower average deviation than the experts by just under 3%. When the consistency was extended to land cover, those pixels which showed consistent choices in land cover had a lower average deviation in human impact by 8.3% compared to those which showed inconsistency in land cover choice. This reflects pixels that were clearly more difficult to identify. Finally, when contributors were the most confident in their choice of human impact, they were also more consistent (average deviation of 7.9%), with consistency decreasing as confidence decreased resulting in an average deviation of as much as 25.9% for the least confident category. This analysis of consistency serves to highlight the need to examine those pixels which were not consistently labeled and which are probably more difficult to judge in terms of both human impact and land cover, which can then be used to help train the volunteers. The results of the regression analyzing the effect of response times are shown in Table 8 and indicate that the agreement between the volunteers and the control pixels increased significantly with a faster response time for human impact, although the effects were small. For each increase in magnitude in response time, the agreement between the crowd and the control pixels increased in accuracy by 1.4%. The average deviation in human impact for pixels of (4) Cultivated and managed and (5) Mosaic of cultivated and managed/natural vegetation and high resolution imagery from the first two control sets was 17.1%. This was compared to the third set of control data points (consisting of only these pixel types) and the average deviation in human impact was lower, decreasing to 14.7%. A t-test confirmed that the means are significantly different from one another (p,0.0001; t = 24.8533; degrees of freedom = 3326.222) and showed that accuracy in human impact actually increased at the end of the competition. Thus, these analyses indicate that there are no particular concerns over quality in relation to response time.

Land Cover
The overall accuracies for the three sets of control points labeled C1, C2 and C3 are presented in Table 9 for the full dataset, considering only those contributions where confidence was high (i.e. 'sure' on the slider bar) and then disaggregated by expertise (i.e. experts or non-experts).
Considering all three sets of control data, accuracy varies between 66 and 76%. There is little difference between the first and second set of controls but there is a marked increase in accuracy for the final set (C3) with 76%. This is unsurprising since the final control sample was drawn from high resolution imagery. When taking only those answers where the volunteers indicated high confidence (or 'sure' on the slider bar), there was around a 3% increase in the accuracy to 69%. Unlike with human impact, experts were more accurate than non-experts, e.g. 62% for nonexperts and 69% for experts for C1 with even larger differences observed for C2 and C3. This suggests that extra training should be provided to those individuals with a non-expert background. As training manuals are often unread or rarely consulted, a more interactive approach could be introduced such that the volunteers are made aware of their errors as they progress through a competition. In addition, a forum could be set up to discuss pixels that present difficulties in identification, particularly for nonexperts. Table 10 shows the user's and producer's accuracies for the five most common land cover types in the dataset. Overall the results show that there is generally an increase in the accuracy across control sets although C3 should only really be considered for cropland and mosaic classes. The lowest accuracies are in shrub cover, grassland/herbaceous and the mosaic cropland class, which  Table 9. Accuracy of land cover (in %) based on comparison of volunteer response with three sets of controls. indicates the need to provide more examples of how these classes appear on Google Earth within the training materials as the volunteers are confusing these classes more often than others. When considering points where the volunteer had a high confidence, the patterns are similar and there is generally an increase in accuracy although the mosaic cropland class continues to be more problematic, with a decrease in the user's accuracy across control sets. Finally, the effect of expertise on land cover classification accuracy produced variable results depending upon the land cover type and the control set considered. For the forest class, the non-experts improved in their ability to correctly identify forest by the second set of controls, while the non-experts actually showed a decrease in the producer's accuracy. Similarly, for the shrub class, the non-expert showed a greater level of improvement in the second set of controls compared to the expert and outperformed them in terms of both user's and producer's accuracy in C2. The experts were better than non-experts at identifying herbaceous, cropland and mosaic but once again there were differences in the user's and producer's accuracies. By building up a picture of where experts and non-experts have differing performance by land cover class, we can tailor the kinds of training materials provided to the volunteers, focusing on areas where greater problems in identification lie.
Similar to human impact, a further analysis was then undertaken on a subset of the data where the volunteers were provided with the same pixels at different times in the competition ( Table 11). The results show that the volunteers were consistent in their response just over 76.1% of the time where this was slightly lower for experts (75.7%) and slightly higher for non-experts (76.7%). A very minor increase to 77.6% was observed when considering only those pixels where the volunteer was sure but when the volunteers were less sure or unsure about their responses, their consistency in response decreased to 66.7%.
The final analysis concerned the relationship between quality in land cover classification and response time. The results showed that the crowd was 40% more likely to disagree with the control for each order of magnitude increase in response time (p,.0001) as shown in Table 12 and indicated by the value of b.
Considering the issue of whether quality in land cover validation (and therefore accuracy) decreased near the end of the competition, we compared the probability that the volunteers agreed with the control pixels for land cover types (4) Cultivated and managed and (5) Mosaic of cultivated and managed/natural vegetation at the end of the competition (75.9%) with that from the early to middle part of the competition (70.6%). This difference was determined to be highly significant (p,.0001; number of trials = 1500; number of successes = 1139) using a binomial test and therefore the accuracy in estimating land cover actually increased in the final stages of the competition. Thus for both human impact and land cover, there are no concerns about the quality decreasing near the end of the competition with a faster response time.

Conclusions
This paper assessed the quality of crowdsourced data collected through a Geo-Wiki competition. Volunteers identified the degree of human impact and classified land cover at random locations using Google Earth images. Quality was assessed by comparing volunteer results with results agreed by experts at a number of control points. Control points were provided to volunteers at the beginning, middle and end of the competition. The results showed that there is little difference between experts and non-experts in identifying human impact while experts were better than nonexperts in identifying land cover. However, the results for both varied by land cover type and through the competition. For example, experts were better than non-experts at identifying shrub land cover at the start of the competition but non-experts improved more than experts and then outperformed them in shrub cover identification by the middle of the competition, indicating that volunteers were learning over time. The volunteers were shown to be reasonably consistent in their characterizations of human impact and land cover with non-experts outperforming the experts in terms of human impact and vice versa for land cover. Moreover, when contributors were confident in their choice of human impact, they were also more consistent, and unsurprisingly, consistency decreased as confidence decreased. Finally, increased response times (as observed towards the end of the competition) did not have a negative impact on quality, and volunteers were therefore not sacrificing quality for the desire to complete more locations and thereby win the Geo-Wiki competition. Thus overall, the non-experts were as reliable in what they identified as the experts were for certain, identifiable situations, and the reliability of the information provided by non-experts improved faster and to a greater degree than experts. Thus, better, targeted training materials and a continual learning process built into the competition might help address these issues. Also, allowing volunteers to reflect on the information they contribute, for example by regularly feeding back evaluations of their data through the use of control points or by making additional material available to them, would also potentially decrease differences between experts and non-experts, particularly in the classification of land cover. The findings of this research relating to the differences between expert and non-expert citizens are also relevant to other areas of research that seek to benefit from the advantages of citizen science. For example, recent activity such as the umbrella Zooniverse project (http://www.zooniverse.org) promotes collaborative projects in many areas of social and physical science research. Currently, registration to its projects captures no information about the contributor, their training or their socio-economic context. Approaches that include information about participant background, control points, reflection, repetition, etc. have broad potential for other citizen science projects that involve classification or identification, e.g. [24,5] where experts can be used to build a database of controls for monitoring and learning purposes.
The next step in this research is to develop robust measures of quality for each location in the crowdsourced database based on rules that take into account the number of times that contributors have provided information at a given location along with the consensus in the answers, their expertise and the confidence in the answers provided. However, the results from this study suggest the need for more nuanced approaches than a simple Linus Law or mass of evidence approach (which have been previously suggested in this domain) for determining when to believe the crowd and therefore when the information they provide can be used with confidence. Formal methods for combining evidence such as Bayesian probability, Dempster-Shafer theory of evidence, Possibility Theory and Endorsement theory provide different ways for combining or partitioning evidence. They allow measures of certainty and uncertainty to be generated and provide different measures of confidence in aggregated information and for determining when the weight of evidence indicates that crowdsourced data or VGI are 'believable'. Since the relationship between reliability and confidence was found to be strong in this research, this also suggests that future activities seeking to incorporate crowdsourced data should capture measures of contributor confidence in the information they provide. Ongoing research by the authors will investigate these areas in more detail.