The curse of observer experience: Error in noninvasive genetic sampling

Noninvasive genetic sampling (NGS) is commonly used to study elusive or rare species where direct observation or capture is difficult. Little attention has been paid to the potential effects of observer bias while collecting noninvasive genetic samples in the field, however. Over a period of 7 years, we examined whether different observers (n = 58) and observer experience influenced detection, amplification rates, and correct species identification of 4,836 gray wolf (Canis lupus) fecal samples collected in Idaho and Yellowstone National Park, USA and southwestern Alberta, Canada (2008–2014). We compared new observers (n = 33) to experienced observers (n = 25) and hypothesized experience level would increase the overall success of using NGS techniques in the wild. In contrast to our hypothesis, we found that new individuals were better than experienced observers at detecting and collecting wolf scats and correctly identifying wolf scats from other sympatric carnivores present in the study areas. While adequate training of new observers is crucial for the successful use of NGS techniques, attention should also be directed to experienced observers. Observer experience could be a curse because of their potential effects on NGS data quality arising from fatigue, boredom or other factors. The ultimate benefit of an observer to a project is a combination of factors (i.e., field savvy, local knowledge), but project investigators should be aware of the potential negative effects of experience on NGS sampling.


Introduction
Noninvasive genetic sampling (NGS) has been implemented with increased frequency in population ecology and conservation biology studies in recent years. NGS allows for the study of a wide range of taxa [1][2][3] without disturbing or influencing the behavior of the study species [3][4][5][6]. Without the use of NGS, invasive or lethal means are relied upon for the collection of genetic samples [7]. Recently, biologists have demonstrated a desire to move away from these invasive methods and are instead choosing to use non-invasive techniques when possible.
NGS has been used to study many species successfully, including the elusive gray wolf (Canis lupus). Collecting genetic information non-invasively using scat and hair has been effective for long-term population monitoring and detecting occupancy over a large spatial area [8][9][10]. Extracting and analyzing DNA from wolf scat and hair has been used as a method to determine population size [9,11,12], pack structure [9,12,13], and dispersal patterns [14,15] without the need for direct observation or capture and radiomarking. Collecting an adequate number of samples can be difficult for species at low densities, however, and may require the collection of hundreds of samples to ensure adequate detection rates [10,12]. Social species, such as the gray wolf, allow for the opportunity to collect many samples of multiple individuals in the same location. In summer, wolves use rendezvous sites, which are areas where members of a pack congregate and provision relatively stationary pups for several weeks until the pups are old enough to travel with the adults in the group [16]. Sampling efforts can be concentrated in and around these rendezvous sites, where presence is easiest to detect and scat samples representing each member of the pack can be found [8]. The successful collection of fecal samples relies on two main factors, 1) the observer's ability to correctly identify wolf sign, and 2) adequately sampling feces to yield a consensus genotype.
While NGS has proven to be extremely useful, issues such as allelic dropout and observer bias continue to resurface, though the impacts of these reoccurring issues can be minimized over time with further examination [17]. Our study attempts to shed more light on one of the most common yet difficult issues to address that continue to resurface in studies involving NGS: observer bias. Due to the high frequency of genotypic errors typically associated with NGS [18], minimizing potential sources of error such as observer bias is critical. Some studies have examined this in scat detection for group surveys, but fewer have compared successful data collection between individual observers [19]. Errors associated with fecal genetic sampling in the field include incorrect species identification [20], improper sampling technique, and poor tracking ability yielding low detection rates. Studies have documented inexperienced participants to be a significant source of bias in the field but showing a decrease in error rates over time [21]. In a study involving multiple observers, it is important that error is acknowledged and assessed to maximize data accuracy and efficiency.
We evaluated NGS success rates for observers with varying experience levels to test how experience influenced the ability to collect viable DNA from fecal samples of a target species. We hypothesized that having previous experience on a project and familiarity with project protocols would positively affect an observer's ability to find and sample wolf scats effectively. Specifically, we predicted that experienced observers would find more wolf fecal samples, have higher DNA amplification rates of those samples, and be more likely to correctly identify wolf scats from other sympatric carnivores (i.e., coyotes, Canis latrans; mountain lions, Felis concolor; black bears, Ursus americanus) compared to new observers.  [22].

Study area
Observers collected scats in Yellowstone National Park (8,983 km 2 ) during 2012-2014. Yellowstone has average temperatures of 5-27˚C and average precipitation between 33-53 mm during the summer months. Yellowstone National Park is dominated by lodgepole pine (P. contorta) forests and expansive meadow systems.
Observers collected samples in the Canadian Rocky Mountains of southwest Alberta, Canada from 2012-2014. We surveyed sites within an area of 12,950 km 2 between the United States border and Highway 1 and west of Highways 6 and 22, including an area known as Porcupine Hills. Some grazing lease lands could not be surveyed because we were denied access by leaseholders. The western portion of the southwestern Alberta study area was mountainous and dominated by Douglas fir (Pseudotsuga menziesii), lodgepole pine, and spruce mixed forests, that transitioned through aspen (Populus tremuloides) to agricultural land and fescue (Festuca spp.) dominated grassland in the east [23].

Field methods
We employed paid observers and considered observers who had not been previously employed on the project as "new" and those who had familiarity with the sampling protocols and had worked on the project in previous years as "experienced". All observers were enrolled in, or recent graduates of, a natural resources program at a university. All observers had prior wildlife-related field experience, although not necessarily using NGS. We trained all observers in wolf sign identification and field navigation for three days prior to sampling. One day included presentations showing pictures of various forms of wolf sign (i.e., tracks, scats, trails), measuring exercises of wolf tracks using plaster molds, and observation of various wolf scat conditions in the lab. We spent two additional days in the field showing observers wolf and coyote as well as other sympatric carnivore tracks and scat. Each observer sampled wolf scats under the supervision of project staff during training.
Observers conducted surveys for wolves at predicted rendezvous sites between mid-June and late August. To target survey efforts, we used a habitat model emphasizing wet meadows and areas with standing water as being consistent locations for rendezvous sites [10]. Observers surveyed sites at dawn and dusk, when wolves are active and more likely to respond to human-mimicked howls [16]. Each survey consisted of driving and walking roads and hiking off trail looking for tracks and scat in and around predicted rendezvous sites. Upon reaching a site, observers used howling to detect the immediate presence of wolves [24]. If there was no howl response from wolves, two observers separated for 30-45 minutes attempting to locate a rendezvous site or wolf sign. If there was a howl response, observers estimated a count of individual wolves and all observers on the crew (up to 6) attempted to locate the activity center, where the majority of sign is concentrated [24]. We made no attempt to sample sites according to experience level (i.e., experienced observes were not assigned more difficult areas to collect scats). We defined scats collected by a single observer outside of an active rendezvous site (>500m from the activity center) as "incidental".
Observers collected canid scats >2.5 cm in diameter and labeled them as "wolf" in the field [25]. Observers only collected scats believed to be deposited after the most recent winter to try and eliminate the collection of samples left by individuals that may have been harvested during the winter. For each scat, observers removed a small portion of the outer layer of fecal material from the side of the scat with sterilized tweezers [8] and stored them in a 2-ml vial containing DETs buffer [26].
During 2011-2014, wolves in Idaho and Yellowstone National Park were fitted with GPS collars, which allowed for crews to hike directly into active rendezvous sites, leaving little room for the collection of scats from non-target species. We did not include any scat samples collected in Idaho and Yellowstone National Park during these years in our analyses of species identification success rates.
Fieldwork was conducted under the University of Montana's Institutional Animal Care and Use Committee (IACUC) protocol IACUC 008-09MMMCWRU-031009, and 001-15MMMCWRU-011315. Four wolves were collared under the IACUC protocol numbers listed above. Additional wolves were collared as part of state and tribal wildlife agency monitoring programs (see [27] for additional details), following recommended American Society of Mammalogists guidelines, and the subsequent locations of radio collared wolves were shared with the authors. Our survey locations were on public land (Canadian Crown land and United States Forest Service land).

Lab and statistical analyses
We sent collected genetic samples to the Laboratory for Ecological, Evolutionary and Conservation Genetics (LEECG) at the University of Idaho, USA, for DNA extraction and amplification. The LEECG extracted DNA from scat samples using Qiagen kits (Qiagen Inc., Valencia, CA) and included a negative control to test for contamination [8]. They first screened all samples with a species-identification test (SpID) using a 3-primer mitochondrial DNA (mtDNA) control region polymerase chain reaction (PCR) [28][29][30][31] to remove non-target species (e.g. coyote) and low-quality samples [14]. They attempted to genotype all remaining samples identified as wolf/dog using 9 nuclear DNA microsatellite loci as described in Stansbury et al. (2014) and analyzed an additional 10 loci on samples matching at all but one locus. For genotyping, LEECG initially amplified all samples twice, and required successful amplification of alleles at �5 loci for the sample to continue for an additional 1-3 replications; samples that amplified at <5 loci were discarded. For each locus they required �2 independent PCR amplifications for consensus of a heterozygote and �3 independent PCR amplifications for consensus of a homozygote. LEECG included a negative control in all PCRs to test for contamination. Given the potentially large number of first-degree relatives in our dataset, they required �7 loci to consider a sample successfully genotyped and �8 loci (P (ID) sibs ranging from 0.0004 to 0.0012 across study areas) [32] to confirm multiple detections of the same individual. LEECG compared all consensus genotypes and all unique genotypes of previously identified individuals using GENALEX [33] to match samples and distinguish unique genotypes. They performed a second species confirmation analysis on all unique genotypes in program STRUCTURE v2.3.3 [34] at K = 3 under the general admixture model, with a burn-in of 100,000, and 500,000 Markov Chain Monte Carlo (MCMC) repetitions and 10 iterations to estimate individual ancestry and remove highly probable dogs or coyotes from the dataset. To avoid overestimation and account for undetected genotyping errors, they grouped samples mismatching by allelic dropout at only one locus (e.g. 102, 102 vs. 102, 106) as a single individual [35]. We used RELIOTYPE [36] to test the accuracy of unique genotypes represented by only one noninvasive sample (i.e. single detections) by ensuring the genotype attained a 95% accuracy threshold.
We calculated successful DNA amplification frequency using the samples collected within active rendezvous sites, where scats of non-target species are unlikely. Scats collected outside of active rendezvous sites (incidental samples), which had the potential to be from non-target species, were used to calculate correct (i.e., wolf) species identification frequency. Samples that were collected but failed to amplify usable DNA, and therefore could not be identified to species, were labeled as "failures".
We used a t-test to assess whether new and experienced observers differed in, 1) the number of scat samples collected, and 2) the number of scats samples collected at active rendezvous sites that successfully amplified (i.e., yielded a consensus genotype). For incidental samples, correctly identifying target species scat is a binary process. We had adequate sample size to use a mixed effects logistic regression model (GLMM) to test for differences in the probability of correct species identification as a function of experience level. We treated year and observer name as random effects and used the "lme4" package in Program R. We considered differences significant when p<0.05.

Results
Thirty-four individuals participated in the collection of samples. Sixteen individuals participated in multiple years, beginning as a "new" observer and becoming "experienced" observers in subsequent years; ultimately resulting in 58 observer/years (33 new and 25 experienced; 1 observer was the author and was defined as "experienced").
Ultimately, when sampling scats that had the potential to be of non-target species (i.e., incidentals), new and experienced observers did not differ in the total number of incidental scats collected (15.1, SD = 7.2 vs. 16.0, SD = 5.8). New observers, however, had a higher probability (0.66 vs 0.50) than experienced observers of correctly identifying the target species from incidental scats. This trend was statistically weaker than other differences we documented (β(experience) = -0.68, p = 0.08; (Fig 2).

Discussion
Contrary to what we hypothesized, we found that previous experience did not yield better overall results during noninvasive genetic scat sampling. New observers collected more samples and identified the correct species slightly more often than experienced observers. Studies have found that more intensive training protocols decreased error rates of inexperienced participants [37], however our findings suggest that while new observers need rigorous training initially, experienced observers may also need additional attention during (re)training. We posit two explanations for why experienced observers performed more poorly than expected. First, when sampling incidental scats, experienced observers may not be as tentative as new observers and thus collect more "questionable" scats, which would result in more nontarget samples collected and therefore poorer identification rates. Collecting questionable scats (i.e., those that may not be the target species) may be beneficial when surveying for rare or lowdensity species, because it could decrease the chance an animal goes undetected. Alternatively, collecting scats of non-target species can waste time and resources, potentially impacting the efficiency of the study. Generally, for our study, sampling non-target species was undesirable. Second, experienced observers may simply have been fatigued or bored and thus performed more poorly than new observers. Boredom and fatigue have been known to influence observer performance in wildlife surveys [38], but were not examined in this particular study.
Incidental samples represented 24% of all samples collected but provided critical information about observer effectiveness. Species identification skills, as well as the ability to detect tracks and scats on a road or trail, are crucial during noninvasive sample collection. Experienced observers did not perform better than new observers in this regard, which suggests that while tracking skills and knowledge of the target species' habits are important, the temperament of human observers also plays an important role in sampling success [19,38]. The majority (76%) of our fecal samples were collected at active wolf rendezvous sites, where accurate species identification of samples is much less critical, as nearly all scats present are wolf. However, such samples are ideal for examining an observer's collection technique, as the samples are usually in good condition for DNA collection (recently deposited) and abundant.
The effect of individual variation in fatigue, boredom, visual acuity, and experience on sample success are difficult to evaluate [39]. These factors have the potential to bias results especially when implementing the same individual observers over time. The relationship between experience level and both amplification rates and species identification of incidental scats should be considered when implementing noninvasive DNA collection techniques.
DNA amplification rates did not differ between the two observer groups which suggests project managers may want to emphasize other aspects of noninvasive fecal DNA projects (e.g. tracking ability) during hiring and training of new individuals. However, the large range in individuals' amplification rates (0.29 to 0.89) and correct species identification rates (0.20 to 0.90) that we observed in our study suggests that biologists should consider the variability in success when involving multiple observers in fecal DNA collection. Jenkins and Manly (2008) also found a potential for high variation in detection rates in fecal surveys. The fact that such differences are, in part, reflective of an observer's experience level suggests that projects using NGS would benefit by working to limit observer bias.
We used genetics to ensure that our target species was correctly identified. Some projects, however, may not have resources for genetic analyses and thus rely solely on observer field identification of target species scat. If observer experience can negatively affect NGS survey success, studies relying on sign surveys and non-genetic approaches may discover ecological consequences associated with observer experience and error. For example, new and experienced observers collected equal numbers of incidental scats, but experienced observers misidentified the target species more often. If we had not verified species identification of incidental samples, genetic sampling conducted by experienced observers would have yielded false positive detections and biased occupancy high because of species misidentification. For rare or imperiled species, such errors can have grave consequences.
An intensive 2-day training period in the field was implemented in this study and could be replicated by others. We show that more seasons spent collecting fecal DNA did not appreciably improve sampling success, therefore the benefits of spending resources to increase in-field training >2 days may be marginal. We recommend carefully weighing observer experience against project goals (e.g., need to detect every individual) and assessing the potential for wasting resources identifying non-target species or, in contrast, missing individuals that may be present. Ultimately, an observer's contribution to a project relies on a combination of factors (i.e., field savvy, local knowledge) and project investigators should be aware that previous experience collecting fecal genetic samples does not ensure greater success.