Examining the use of Amazon’s Mechanical Turk for edge extraction of the occlusal surface of fossilized bovid teeth

In order to reconstruct environments associated with Plio-Pleistocene hominins in southern Africa, researchers frequently rely upon the animals associated with the hominins, in particular, animals in the Family Bovidae. Bovids in southern Africa are typically identified by their teeth. However, identifying the taxon of a bovid tooth is challenging due to various biasing factors. Furthermore, inaccurate identification of fossil bovids can have significant consequences on the reconstructed paleoenvironment. Recent research on the classification of bovid fossil teeth has relied on using elliptical Fourier analysis to summarize the shape of the outline of the occlusal surface of the tooth and the resulting harmonic amplitudes. Currently, an expert in the field must manually place landmarks around the edges of each tooth which is slow and time consuming. This study tests whether it is possible to crowdsource this task, while maintaining the necessary level of quality needed to perform a statistical analysis on each tooth. Amazon Mechanical Turk workers place landmarks on the edge of the tooth which is compared to the performance of an expert in the field. The results suggest that crowdsourcing the digitization process is reliable and replicable. With the technical aspects of digitization managed, researchers can concentrate on analyzing and interpreting the data.


Introduction
Reconstructing past environments associated with early hominins is essential for understanding human evolution and is valuable for identifying habitat preferences, diet, and ecological relationships between hominins and other species. In order to reconstruct past environments, paleoanthropologists commonly rely on the animals that are found associated with the hominins. Animals in the Family Bovidae such as antelopes and buffalo are particularly useful for this task due to their strict ecological tendencies [1][2][3]. In addition, bovids are one of the most common fossils found in southern Africa, in particular isolated teeth. However, identifying a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 bovid teeth in the fossil record is complicated by biasing factors such as attrition and sex [4]. Overlap exists in the form (i.e. size and shape) of bovid teeth making it difficult to identify the taxon and, therefore, difficult to reconstruct the past environment [4]. The purpose of this study is to demonstrate a reliable, replicable, uncomplicated method for extracting the form of the occlusal surface of bovid teeth which can then be used to identify teeth in the fossil record. Several recent studies have demonstrated that morphometrics is particularly useful for documenting biological shape [5][6][7][8][9][10][11]. This new methodology extracts edges by relying on crowdsourcing. The outlines are then used in supervised machine learning techniques in conjunction with elliptical fourier analysis (EFA) [12].
It should be noted that ideally edge extraction of the occlusal surface of these teeth could be performed using automated procedures based on techniques such as those described in [13] or [14]. However, in this specific setting automated methods are difficult to use as these techniques tend to often identify the bottom of a tooth as the edge rather than the actual occlusal surface.
Previously, [1] performed a study to standardize the identification of bovid teeth using EFA. While successful in identifying bovid taxa, the process to extract the outlines was tedious and time consuming. In order to extract the outline of a tooth, an image was imported into a digitizer program, MLmetrics [15], where 60 points were manually placed around the tooth according to a template so as to maintain homology. The points were then exported and analyzed in a fourier analysis program [16]. The study generated occlusal outline information for over 7000 extant and fossil teeth. However, the results could not be easily used to identify fossils from new sites due to the time consuming nature of the process of edge extraction. The present study provides results of an exploratory analysis that employs Amazon's Mechanical Turk platform [17] as a method to crowdsource the edge extraction of bovid teeth.
In this study, the digitized outlines of an expert in the field, the co-author Juliet K Brophy (JKB), are compared with up to three outlines extracted by Amazon Mechanical Turk workers. The results of this preliminary study suggest that crowdsourcing the digitizing process is reliable and replicable. Furthermore, this streamlined process allows for more teeth to be processed in a timely manner, saves the time of researchers from performing technical tasks, and frees them up to focus more of their time on aspects of this project that require expertise, such as analyzing and interpreting the data.

Related work
Mechanical Turk [17] was introduced by Amazon.com, Inc. in 2005. As such, there is a relatively limited body of scholarly work exploring the uses of the platform. The projects that task quality assessment, the focus of this study, can be divided into two categories: assessing survey response accuracy and annotating digital images.

Assessing survey response accuracy
Studies in this category focus on investigating how accurate survey responses are from Mechanical Turk Workers. These analyses aim to answer questions such as: How closely do Mechanical Turk surveys reflect surveys distributed using more traditional methods? [18]; How honest are Mechanical Turk workers in their responses? [19,20]; and Does Mechanical Turk provide researchers with a more diverse response pool than the mainstay of distributing surveys to college students with the promise of extra credit? [21] [20] uses Mechanical Turk in order to combine the speed and cost-effectiveness of a simulated study with the authenticity of human behavioral studies when analyzing human cooperation. The study claims that prior to Mechanical Turk and the ability to crowdsource data collection, most evolutionary models were based on simulations or mathematical algorithms due to the lack of survey labs and a consistent subject pool. With its use, however, researchers can request a task to be done and collect results entirely online much in the same way a simulation study is conducted. With that said, [20] mentions that a major concern of using Mechanical Turk is the lack of control researchers have over their subjects. It is possible, for instance, for subjects to incorrectly answer a question due to a lack of understanding. Additionally, subjects are completely free to leave in the middle of the survey. After conducting a number of experiments, both online and in person, [20] found that these limitations had a very small effect on the results.
In a similar study, [21] conducted an experiment comparing the performance of Mechanical Turk workers versus subjects in a controlled laboratory setting in an acceptability judgment task. The main concern addressed in [21] is that additional noise, introduced by using Mechanical Turk, might detract from the power of the experiment. To help control for this, they introduced a rejection criteria. Mechanical Turk workers were required to be native English speakers, which resulted in a 15% rejection rate. [21], like [20], states that another major concern in the use of Mechanical Turk is the inability to establish whether or not the Turker understood the task, possibly resulting in inaccurate data. It concluded, however, that using Mechanical Turk is comparable to laboratory research as long as a mechanism exists to reject certain responses.
Additional information on testing best practices when using Mechanical Turk in survey research can be found in [19], which evaluates how various factors effect the reliability of responses, and [18], which compares the demographics of Mechanical Turk respondents to national demographics.

Annotating digital images
This category of Mechanical Turk work evaluates the quality of edge extraction research. Two of the primary works related to this topic include [22] and [23]. [22] explored the use of Mechanical Turk in image classification focusing on techniques for automatically "cleaning" the data sets. They demonstrate that by using multiple methods for measuring the accuracy of annotations they can outperform other methods that rely on a single measure. They also demonstrate that image classification can be performed with high levels of accuracy when using Mechanical Turk workers to extract the edge of images. Further, classification accuracy can be improved by over 7%, by cleaning the data using the techniques considered in this study. [23] evaluates various annotation techniques with the goal of maximizing quality while minimizing cost. This research used landmark-based edge extraction and a gold standard method of grading. Landmark extraction, or annotation, involves having a Turker place a number of points along the border of an image. Once the outline is extracted, it can be tested for quality against an outline annotated by an expert, which is referred to as the "gold standard" grading technique. While it was not used in this particular study, [23] also mentions grading outlines based on their distance from the mean image produced by multiple Mechanical Turk workers, which may be useful as it eliminates the need for expert tracing.

Methods
This exploratory study includes a sample of 96 teeth of known species from four different tribes: Alcelaphini, Bovini, Hippotragini, and Neotragini. These teeth were obtained from the Ditsong Museum (TM) (formerly Transvaal Museum) and the National Museum of Bloemfontein (NMB), South Africa. (Permission to use these specimens was received by JKB from both institutions (i.e. National Museum, Bloemfontein and Ditsong Museum (formerly Transvaal Museum)). Permits are not required to look at extant bovid specimens in South Africa. Therefore, no permits were required for the described study.) The complete repository information is in Table 1. Permission was received from each institute to photograph these specimens. No permits were required for the described study, which complied with all relevant regulations.
We investigated three mandibular molars (LM1, LM2, LM3) and two maxillary molars (UM2, UM3). Details of the data are shown in Table 2. An example of the raw image of a tooth prior to extraction can be seen in the left side of Fig 1. Prior to being digitized by a Turk worker, all of the teeth were scaled to each other.

HIT protocol
Amazon states: "A Human Intelligence Task, or HIT, is a question that needs an answer. A HIT represents a single, self-contained task that a Worker can work on, submit an answer, and collect a reward for completing" [24]. Specifically in this setting, the Mechanical Turk worker downloads the image of a bovid tooth in the freeware GIMP (the GNU Image Manipulation Program) [25]. After testing several programs for obtaining the polygon, this program produced the best results. Next, the Mechanical Turk worker selects the lasso tool which allows a polygonal selection to be made around the tooth. Once the bounding polygon has been created, the user then cuts and pastes the extracted selection onto a blank canvas. This shape is then filled in with all black using the bucket fill tool in GIMP creating a black and white image of each tooth where the interior of the tooth is black and the background is all white. The resulting file is then saved onto one's computer and uploaded to the link provided in the HIT.

Processing the Mechanical Turk output
For every raw image of a tooth considered in this study, Mechanical Turk workers were asked to extract the outline of the occlusal surface in GIMP [25]. This process was repeated 3 times for each tooth. (Mechanical Turk workers were used only to trace images of bovid teeth. No personal information relating to any mechanical Turk worker was collected.) The output from where H is the number of harmonics used, A 0 and C 0 are constants, and a h , b h , c h , and d h are the amplitudes associated with the h-th harmonic and h = 1, 2, Á Á Á, H. Since EFA is not a landmark based procedure, the initial ordering of the points does not hinder the estimation of the harmonics. Next, so that we are able to perform landmark based analysis, we used the estimated harmonics to output a specific number of points around the edges of each tooth which all begin in the same location. These resulting points act as landmarks, which were used to calculate Riemann distance between shapes created by Mechanical Turk workers and created by the expert. Additionally, the amplitudes (i.e. a h , b h , c h , and d h ) created in EFA can then be used as input features in machine learning algorithms to classify the teeth to tribes and species. Since ultimately what we are interested in is classifying these teeth, the performance of classifiers based on the work of Mechanical Turk workers was compared to the classification accuracy when the model was trained using the outlines traced by the expert. The classification algorithm considered here was random forests [28]. The tracings from the Mechanical Turk worker and the expert were compared to assess how similar they are and to asses differences in the predictive accuracy. In order to measure the tracing error, Riemanian distance [21] was calculated between the Turkers tracings and the expert tracing. To do this, we first extracted the edges of the black and white images using the "import_jpg" function in the "Momocs" [19] package in R. This creates a given number of (x, y)-coordiates for the outlines of the black and white images. However, the ordering of these points may not line up correctly with the ordering of another tracing of the same tooth. These harmonics can then be used as input in the function "efour-ier_shape" to output 150 (x, y)-coordinates which act as landmarks around each tooth so that a direct comparison can be made between the mechanical Turk tracings and the tracings performed by the expert.

Evaluation of Mechanical Turk work
In order to measure the tracing error, Riemanian distance [29] was calculated between the tracings generated by Turkers and the expert tracing. To do this, we first extracted the edges of the black and white images using the "import_jpg" function in the "Momocs" [27] package in R [30]. This creates a given number of (x, y)-coordiates for the outlines of the black and white images. However, the ordering of these points may not line up correctly with the ordering of another tracing of the same tooth. These harmonics can then be used as input in the function "efourier_shape" to output 150 (x, y)-coordinates which act as landmarks around each tooth so that a direct comparison can be made between the mechanical Turk tracings and the tracings performed by the expert.
Ultimately the goal of tracing these outlines is to accurately classify the tribe and species that these teeth represent. Previous work [31] compared five different machine learning algorithms based on their performance classifying teeth into tribe and species. Here, we only consider the use of random forests for classification of tribe to compare the tracings created by mechanical Turk workers to the tracings created by JKB.

Tracing error
The Riemanian error distances ranged from 0.01113 to 1.113 with a median error of 0.1154. A histogram of this distribution can be seen in Fig 4. Notice that the distribution is skewed heavily to the right and indicates that many of the Mechanical Turk workers trace the outline with only small amounts of error with a full 50% less than 0.1154. For reference, Figs 5 and 6 show two examples of the work of Mechanical Turk workers, with outlines in red, yellow, and blue, compared to the gold standard, which is shown in black. In Fig 5,

Predictive accuracy
The histogram seen in Fig 9 depicts the classification accuracy results from the crowdsourced tracings. These results were created by repeatedly sampling one of the at-most three tracings per tooth in order to make a data set. Leave-one-out-cross validation was then performed using random forests. Accuracy of the model was quantified using a log loss score, comparing the predicted class to the actual observed class. From the histogram, it can be seen that if only one Turker for each image was used, they would perform consistently and considerably worse than the expert. The best sample is roughly .85 in terms of log-loss, while the mean is closer to 1.3, while the worst case is nearly 1.5.
The dotted line labeled "Mean of MTurk" was calculated by classifying the average shape of the Turkers outlines after eliminating obviously incorrect tracings. One can see that there is an improvement over even the best sample of individual workers. By taking the average image, the log-loss value lowered to 0.7788 for classifying the tribe.
Using the expert's tracings we can further reduce log-loss, which is to be expected, down to 0.6689. While this is certainly an improvement over the Mechanical Turk workers, we argue that this level of log-loss is still acceptably close to the expert to still be of use in that the time that is saved by crowdsourcing the extraction of the edges is worth a small trade-off in classification accuracy.
Finally, we evaluated the classification performance of the traced outlines by averaging all of the Mechanical Turk workers (excluding images where the Riemann distance was greated that 0.2 from the expert) and the expert. This slightly improved classification accuracy compared to the average of the Mechanical Turk workers to a log-loss of 0.7524; however, the expert alone still has the lowest log-loss. Finally, we consider results in terms of misclassification rather than log-loss. Table 3 shows the misclassifications for JKB alone. Using only those tracings, the model was able to classify correctly 79% of the specimens in cross validation. A large amount of the error occurred between Alcelaphini and Hippotragini. Namely, of the missclassified observations, 75% were either actually Alcelaphini but classified as Hippotragini, or actually Hippotragini but classified as Alcelaphini. Table 4 shows the missclassification results of the average image from the Turkers. The model correctly classified the Turker results 74% of the time. Once again, the largest source of confusion was between Hippotragini and Alcelaphini. Table 5 shows the results when the outlines of the Mechanical Turk workers were averaged with the gold standard. Somewhat surprisingly, this result was worse in terms of missclassification than the other two specifications considered here with a classification rate of 68% in spite of being better than using the Turk outlines only in terms of log-loss. Edge extraction using Amazon's Mechanical Turk

Discussion
The results of this study suggest that the proposed method will dramatically decrease the amount of subjectivity in bovid tooth identification and will advance the field of paleoanthropology/zooarchaeology. The importance of this method cannot be understated. As mentioned previously, bovids have different ecological requirements. Therefore, misidentified bovids can lead to incorrect paleoenvironmental reconstructions. For example, three researchers analyzed the bovid fauna from the South African site of Makapansgat and proposed paleoenvironmental reconstructions for Member 3 [32][33][34]. While each researcher relied upon the same assemblage to form their reconstruction, the papers suggest a different paleoenvironment: shrub-like with nearby open grasslands [32]; woodland [33]; and bushland with riparian woodland and nearby limited wetlands [34]. Reconstructions like these are used to discuss hominin behavior as well as speciation and extinction events. In fact, until recently it was commonly thought that one early human ancestor, Australopithecus robustus, went extinct due to being a habitat specialist that could not survive in fluctuating environmental conditions [4]. By more accurately identifying the bovids from sites associated with A. robustus using morphometrics, [4] was able to demonstrate that this hominin lived in a variety of habitats that changed over time; A. robustus was more likely a habitat generalist. Therefore, the hypothesis that A. robustus went extinct because it was a habitat specialist requires rethinking. If a fraction of these subjectivity problems are solved with this new methodology, the field is advancing and more accurate paleoenvironmental reconstructions and interpretations will be made.
With that said, some preliminary issues exist with this methodology. First, if a large number of teeth needs to be traced with replicates of each tooth, this process can get expensive. In the future, ideally, we will be able to leverage modern computer vision algorithms to extract the edges of these teeth with little or possibly no human aid. Second, some teeth are more difficult for a lay person to trace (e.g. LM1) and those teeth may still require an expert to trace those teeth or at least someone who has received more training than the average Mechanical Turk worker. This result is not unexpected as this method is not designed to completely replace all other forms of tooth identification, rather it is intended to provide objective, reliable classifications of bovid teeth and to supplement and be supplemented by other forms of tooth identification, as needed. Regardless of these problems, the benefits of employing this method and decreasing the subjectively involved in bovid tooth identification far outweigh the issues.

Conclusion
This study demonstrates that by taking the average shape of multiple Mechanical Turk workers, we can quickly obtain the outline the occlusal surface of a tooth that performs similarly to the expert's in terms of classification. A database was created of 96 different teeth along with the associated ground truth tracings done by an expert. Once outlines traced by non-experts through Amazon's Mechanical Turk were collected, we imported an outline into R and lined up landmarks for comparison using EFA. The accuracy of the tracings was evaluated by calculating the Riemann distances between the landmarks on the crowdsourced outline and the outlines generated by the expert. Further, predictive accuracy was assessed using leave-one-out cross validation with random forests on a small subset of the data. We find that in terms of log-loss the tracings performed by the expert, while superior, were not substantially better than using the average of the mechanical Turk workers. In terms of classification accuracy, we measured 74% classification rate using the average of the tracings of the mechanical Turk workers, which is very close to the classification accuracy of 79% when using the tracings generated by the expert. The results suggest that this process can be useful for researchers in many scientific areas (e.g. anthropologists, paleontologists, zooarchaeologists, etc.) who need quick, objective classifications for teeth recovered in the field. Further, one area of future work we are particularly interested in is the analysis and classification of partially observed teeth due (i.e. broken teeth). We believe that this method explored here can be easily extended to the case when teeth are broken.