Validating metabarcoding-based biodiversity assessments with multi-species occupancy models: A case study using coastal marine eDNA

Environmental DNA (eDNA) metabarcoding is an increasingly popular method for rapid biodiversity assessment. As with any ecological survey, false negatives can arise during sampling and, if unaccounted for, lead to biased results and potentially misdiagnosed environmental assessments. We developed a multi-scale, multi-species occupancy model for the analysis of community biodiversity data resulting from eDNA metabarcoding; this model accounts for imperfect detection and additional sources of environmental and experimental variation. We present methods for model assessment and model comparison and demonstrate how these tools improve the inferential power of eDNA metabarcoding data using a case study in a coastal, marine environment. Using occupancy models to account for factors often overlooked in the analysis of eDNA metabarcoding data will dramatically improve ecological inference, sampling design, and methodologies, empowering practitioners with an approach to wield the high-resolution biodiversity data of next-generation sequencing platforms.


Introduction
Environmental DNA (eDNA) as a signal for diversity detection is rapidly advancing. In freshwater systems, in particular, eDNA is now used as a bioassessment tool in both single-species qPCR-based studies and in sequencing-based metabarcoding community assessments [1][2][3]. Approaches based on eDNA are also gaining traction in the marine environment [4,5]. Oceans are complex, highly diverse, and difficult to sample; therefore, identifying organisms from all trophic levels and taxonomic groups from a single survey method will greatly facilitate rapid, consistent biodiversity surveys [6]. eDNA metabarcoding provides a streamlined method of biodiversity assessment, generating high-resolution biodiversity data with time and effort savings during sample collection and analysis [7,8].
However, there are several levels of uncertainty associated with eDNA sampling for community assessments. The potential for false negatives during sampling, where a species present in the environment is not detected in surveys, can bias results [9]. False negatives can occur during field sampling and during lab processing. If imperfect detection is not accounted for, this could lead to biased estimates of species richness and individual species occupancy [10,11]. Accounting for false negatives will improve community-wide species occurrence estimates based on eDNA surveys and yield more robust ecological conclusions for making management decisions and informing sampling designs. Optimal sampling designs for eDNA metabarcoding studies are not well-established and differ from traditional ecological sampling methods in the cost and effort required for sample collection [12]. Additionally, there are several added variables that need to be accounted for in metabarcoding studies compared to traditional sampling approaches, such as sequencing depth and marker selection, which vary between studies and can affect metabarcoding results [5,13,14]. Sampling designs should be experimentally informed and optimized specifically for eDNA metabarcoding methods [15], yet this is seldom practiced, and these added sources of variation during sample processing are seldom considered in the same analysis as sampling design. Occupancy modelling is a powerful tool to account for the additional sources of variation associated with next-generation biomonitoring approaches, and it has been used to assess imperfect detection in terrestrial bioassessment [16][17][18]. These models include 2-levels: the probability that a species occurs at a site (occupancy; ψ) and the probability of detecting a species at a site (probability of detection; p). Recently, occupancy models have been adapted for single-species eDNA studies by adding an additional stochastic level, the probability of capture (θ). Occupancy refers to the probability of a species' DNA occurring at a site, the probability of capture (θ) refers to the probability of capturing a species' eDNA in a field sample, and the probability of detection refers to the probability of detecting a species in a PCR replicate [19,20]. The use of occupancy models in single-species eDNA studies is not ubiquitous, but it is increasing [21].
Occupancy modelling can also be applied to whole communities through multi-species occupancy models, which are commonly applied to traditional surveys in terrestrial systems [22,23], yet seldom used in the context of DNA metabarcoding (S1 Table). In the same way that single-species models were adapted for eDNA studies through the inclusion of an additional stochastic level, multi-species models can be adapted for metabarcoding by including this additional level. Modeling communities together in a single multi-species model can improve the accuracy and predictive ability of occupancy models compared to single-species models [24]. The application of multi-species, multi-scale occupancy models to metabarcoding data is rare, focusing on small-scale lab manipulations [25], and no studies have implemented this modelling approach to improve sampling designs in natural systems (but see [26] for a single species example). Incorporating these models routinely in metabarcoding analysis will improve ecological inferences and species richness estimates, as well as facilitate the development of robust sampling designs for a relatively new technique where little thought has been dedicated to developing de novo sampling designs distinct from traditional sampling methods. The inclusion of covariates in occupancy models at each process level extends the application of the model, enabling discrimination between sources of variation in sampling effort and environmental factors. However, making conclusions based on models with covariates requires methods of model assessment and selection for multi-species, multi-scale models.
Here, we demonstrate how multi-species occupancy modelling can be used for the analysis of community biodiversity data resulting from eDNA metabarcoding and highlight the potential of these models for both improving methodologies and sound ecological inference. We present methods for model assessment and model comparison adapted for multi-scale, multispecies occupancy models. Finally, we demonstrate how these tools can improve inferential power from eDNA metabarcoding results using a case study in a coastal, marine environment.

Model formulation
The multi-species, multi-scale occupancy model. We used a Bayesian modeling framework to develop a multi-species, hierarchical occupancy model with three stochastic levels: occupancy (ψ), probability of capture (θ), and probability of detection (p) (Fig 1). The occupancy process describes whether sampling sites are occupied or not by a given species' DNA. For eDNA sampling, there are often two levels of sampling replication within each site (e.g. [20,27]): biological replicates are samples collected from a single site in the field and technical replicates are repeated samples taken from a single biological replicate in the lab. The probability of capture refers to the probability that a species' DNA is collected in a sample, given that the species was present at the site. The probability of detection refers to the probability that a species was detected in a technical replicate, given that the species' DNA was collected in the sample. This model assumes no false positives occur in the data. While false positives may be a possibility in metabarcoding data [15], we used strict bioinformatic filtering to reduce this possibility (see Bioinformatics below). Further comments on false positives can be found in the Discussion.
This model can be fit to a dataset, y ijrk , which is a binary indicator of whether a species k (k = 1,2,. . .K) was detected (1) or not detected (0) in a technical replicate r (r = 1,2,. . .R) from a given sample j (j = 1,2,. . .J) at a given site i (i = 1,2,. . .I). The model consists of three coupled Bernoulli trials to describe a four-dimensional array of data y ijrk .
We model our observations of detection (y ijrk = 1) or non-detection (y ijrk = 0) of species k in replicate r in sample j at site i as a random variable having parameter p ijrk , which describes the probability of detection. The second random variable, w ijk , describes the capture (w ijk = 1) or non-capture (w ijk = 0) of species k in sample j at site i having parameter θ ijk (probability of capture). The third random variable, z ik , describes the occurrence (z ik = 1) or non-occurrence (z ik = 0) of species k at site i having parameter ψ ik (occupancy probability).
Model assessment and comparison. The goodness-of-fit of multi-species, multi-scale occupancy models can be assessed using Bayesian p-values of the deviance residuals. We adapted Bayesian p-value calculations from [28] for a multi-scale model (S1 File) to assess goodness-of-fit, where values close to 0.5 indicate a good fit and values >0.95 or <0.05 indicate a poor fit. We also adapted model selection and cross-validation calculations from [28] for multiscale, multi-species occupancy models to determine the best model. We calculated the Watanabe-Akaike information criterion (WAIC; [29]) and the conditional predictive ordinate criterion (CPO; [30]), and then evaluated the results of k-fold cross validation using the Brier score and the logarithmic score. Complete calculations for all model assessment and comparison methods can be found in S1 File.
Unknown species richness. An additional level can be added to the model described above for communities with unknown species richness [10]. This model uses data augmentation to estimate species richness for the sampling area through the inclusion of another

PLOS ONE
k is present in the metacommunity. Here, the random variable z ik describes the occurrence (z ik = 1) or non-occurrence (z ik = 0) of species k at site i having parameter ψ ik , which describes the occupancy probability for species present in the metacommunity. z ik � Bernoulliðc ik a k Þ Species may be present in the metacommunity but be unobserved during surveys due to imperfect capture and detection. Summing a k provides an estimate of species richness for the metacommunity, including observed and unobserved species.

Case study: Conception Bay, Newfoundland
Sample collection, processing and sequencing. Triplicate 250 mL water samples were collected from coastal surface water at eight sites along two transects in Conception Bay, Newfoundland and Labrador, Canada, on October 13-14, 2017 (see [5] for sampling details). No permits were required to collect samples since there are no regulations on collecting seawater. Water samples were filtered using 0.22 μm PVDF Sterivex filters (MilliporeSigma) and DNA was extracted from filter membranes using the DNeasy PowerWater Kit (Qiagen). Five target amplicons in the cytochrome c oxidase I (COI) region were amplified by PCR from each sample. Table 1 details the primer sets used to target these amplicons. Three PCR replicates were performed for each amplicon from each sample and then PCR replicates were pooled into a single PCR cleanup for each of the five amplicons with the QIAquick 96 PCR purification kit (Qiagen). Amplicons were then indexed using unique dual Nextera indexes (IDT). All amplicons were pooled into one library to normalize DNA concentration and the library was sequenced with a 300-cycle S4 kit on the NovaSeq 6000 following the NovaSeq XP workflow. Raw sequence reads are available in NCBI's sequence read archive under accession number PRJNA574050. Primers were trimmed from sequences and then DADA2 v1.8.015 [32] was used for quality filtering, joining paired end reads and denoising to produce exact sequence variants (ESVs). Taxonomy was assigned using NCBI's blastn tool v2.6.026 [33] to compare ESV sequences against the nt database. See [5] for detailed sampling, sequencing, and bioinformatic methodology.
Occupancy model implementation. Under the occupancy modelling framework described above, each collection site along each transect in Conception Bay was considered a different site in the occupancy model. Replicate bottles collected at a site were considered samples. Each of the five amplicons sequenced from each bottle was considered a technical replicate. While we conducted three replicate PCRs for each amplicon, PCR products were pooled for each amplicon prior to sequencing so we did not include PCR replicates for each amplicon separately in our models. However, PCR replicates can easily be accommodated in a multiscale, multi-species occupancy modelling framework, such as the model described here. https://doi.org/10.1371/journal.pone.0224119.t001

PLOS ONE
We included sequencing depth (number of reads per sample per amplicon) as a continuous covariate (α1) at the level of probability of detection. Additionally, we included amplicon identity as a categorical covariate (α2) at the level of probability of detection. We included water depth (m) as a continuous covariate (α3) at the level of occupancy. Covariates were included in the model as follows: Continuous covariates were z-score standardized to have a mean of zero and a standard deviation of one to help with model convergence. We compared a null model (Model 1) with no covariates with seven models with different combinations of covariates ( Table 2) using WAIC and CPO and using Brier and logarithmic scores for cross-validation. We assessed model fit using Bayesian p-values based on deviance residuals and by looking at diagnostic plots to examine model fit. We plotted deviance residuals for each species, site, and covariate. Species coefficients arise from additional community-level parameters: Community-level parameters were described by weakly informative hyperpriors [31]. All mean values for the above prior distributions were selected from a normal distribution and all standard deviations were selected from a uniform distribution.
Prior sensitivity was assessed by running the models with various prior parameterizations. Posterior distributions were similar across all priors.
All statistical analyses were conducted in R v3.5.1 [38]. MCMC sampling was achieved with JAGS [39], implemented using 'jagsUI' v1.5.0 [40]. The model was written for JAGS in the BUGS language (see S2 File for BUGS model structure). We fit models using known species richness to conduct our model comparisons, and assessed models and model fit to determine the best model. MCMC sampling was run in three chains, each with 50,000 iterations, a burn in of 10,000, and a thinning rate of 10. Convergence was verified using the Gelman-Rubin diagnostic [41] and by evaluating trace plots. For all models, we report parameter estimates as the mean of the posterior distribution with the 95% highest posterior density interval (HDI; [42]) calculated using 'HDInterval' v0.2.0 [43]. We conducted a data augmented model with unknown species richness for the best model at varying levels of augmentation to determine the minimal level of augmentation required, as described above in the Unknown Species Richness section. Significance of continuous covariates was assessed by determining if the 95% confidence intervals of parameter estimates overlapped with zero [31].
To investigate the effects of phylum on the probability of detection of each amplicon, we ran one additional model (Model 9), which included amplicon as a categorical covariate (α2) and group-level effects at the species-level following [31]. In this model, we only included metazoan phyla where at least 2 species were detected. Here, species coefficients arise from community-level parameters that vary by phylum:

Results
We ran eight multi-species, multi-scale occupancy models with different combinations of covariates (i.e., water depth at the level of occupancy, sequencing depth and amplicon at the level of detection probability) and assessed these models using model comparison and crossvalidation methods adapted for this multi-scale approach ( Table 2). Three of the model comparison methods (WAIC, CPO and one cross-validation score) were in agreement that Model 7 (ψ(water depth) θ(.) p(sequencing depth)) was the best model, while the Brier score from cross-validation suggested Model 1 (the null model) and Model 3 (ψ(.) θ(.) p(sequencing depth)) were the best models. We considered Model 7 our best model moving forward, given that most selection methods indicated this was the best model. We assessed model fit using Bayesian p-values and diagnostic plots for all models but present the results for the best model only. We obtained a Bayesian p-value of 0.51, suggesting that Model 7 (ψ(water depth) θ(.) p(sequencing depth)) provided a good fit to our data overall; diagnostic plots, however, revealed higher deviance at sites with lower water depth, suggesting a poorer model fit at shallower sites (see S3 File). The community-wide estimate for occupancy was 0.29 (HDI: 0.22-0.36). Water depth had a significant effect on the community mean occupancy (Fig 2A and 2B), and we detected considerably more species at the shallowest sites compared to the other sites (274 species at two shallow water sites combined compared to 109 species across all six deep water sites). The community-wide probability of capture was 0.96 (HDI: 0.92-0.99) and the community-wide probability of detection was 0.14 (HDI: 0.12-0.17). Sequencing depth did not have a significant effect on the probability of detection for most species in this case study (Fig 2C and 2D). Species-specific estimates of occupancy, capture probability, and detection probability were also obtained from the model (S2 Table).
We estimated species richness for the survey area by running the best model (Model 7) with data augmentation. This model used the probabilities of capture and detection to estimate the number of species missed in sampling efforts. We detected 231 species overall, and the estimated species richness for the survey area was 286 (HDI: 264-306), indicating that 55 (HDI: 33-75) species were undetected during our surveys. In other words, our survey detected~81% of the estimated species in our study area.
While it was not selected as our best model, we also ran Model 9 (ψ(.) θ(.) p(amplicon)) with a group-level effect (phylum) to investigate how the probability of each amplicon varied by phylum. Amplicons displayed relatively similar probabilities of detection at the community-level (Fig 3), however the probabilities of detection for each amplicon varied considerably when comparing between phyla, where some amplicons clearly failed to detect certain taxonomic groups (Fig 4).

Discussion
We applied a multi-species, multi-scale occupancy model to a DNA metabarcoding dataset generated from marine water samples and explored how the inclusion of categorial and continuous covariates at different levels improved model performance. The best model included sequencing depth as a covariate at the level of detection and water depth as a covariate at the level of occupancy, where we observed a higher species richness at shallower sites. One of the shallow water collection sites was within 1 km of a sewage outflow, which may have contributed to this result, although a high species richness was also observed at the second, shallow water site located >10 km from the sewage outflow. While sequencing depth was included in the best model, we did not observe a strong effect of sequencing depth. However, the samples were all sequenced on a NovaSeq instrument, which generates an unprecedented number of reads, yielding very high sequencing depths (mean number of filtered sequences per sample ± standard deviation: 8,519,055 ± 2,514,998) compared to many other barcoding studies (e.g. [44,45]). In studies where the mean sequencing depth is lower, differences in sequencing depth are likely to have greater effects [5,46]. In such cases, analyzing data using occupancy models that include sequencing depth as a covariate will allow the variation in sequencing effort, which cannot always be controlled, to be accounted for when making ecological conclusions about biodiversity and occupancy.
The mean probability of capture estimate of 0.98 suggests a high probability of collecting a species' DNA in a given sample. However, the mean detection probability was relatively low at 0.15, likely because many species were not detected consistently by multiple amplicons, and a

PLOS ONE
low probability of detection can lead to overestimates for higher level parameters, including probability of capture [47]. Additionally, many species were only detected in the shallow water sites and were detected consistently across biological replicates at these sites, further contributing to a high estimate for the community mean probability of capture.
Species-specific probabilities of detection varied by amplicon and by phylum. Since we did not have replication within each amplicon for each sample, the effects of amplicon are confounded with the effects of occasion (i.e. PCR stochasticity). However, the effects of amplicon on species detectability were consistent across samples and it is very unlikely that this pattern would be observed due to PCR stochasticity alone, which is a random process. Therefore, we assume the effects we observed at the level of detectability were driven by the different amplicons used. Since the performance of each amplicon varies by taxonomic group (this study; [13]), including a variety of amplicons is important to detect species across the tree of life, and increasing the number of technical replicates using a single amplicon will not necessarily improve the community-wide probability of detection. Information such as this can be used to guide primer selection for future metabarcoding studies to maximize the probability of detection for target taxa or to ensure broad taxonomic coverage for holistic biodiversity surveys.
We used the occupancy modeling framework to estimate the species richness for the survey area and determined that 53 species or approximately 19% of the estimated number of species present were undetected during our surveys. Similar to many ecological studies, the case study presented here included a relatively low spatial coverage (n = 8 sites), but our occupancy modelling approach allowed us to assess false absences at two different sampling levels in our study and thereby understand what portion of biodiversity was missed, which is a significant improvement from most metabarcoding surveys [11]. Understanding how biological replication and technical replication affect biodiversity estimates can inform future sampling designs to maximize biodiversity detection while minimizing cost and effort. In our case study, the sampling effort was limited, and thus there are several ways the proportion of species detected could be improved: (1) increasing sampling effort in the field by sampling more sites, (2) collecting more replicate biological samples at each site, and (3) including additional amplicons during laboratory processing. Given the limited extent and breadth of our sampling effort, the conclusions regarding the effect of covariates and the estimates of occupancy, capture, and detection probabilities for individual species should not be extrapolated to other systems. Further research should investigate the impacts of variation in sequencing depth and amplicons targeted on detection probability in metabarcoding studies, particularly in other ecosystems and across greater spatial scales.
Through the inclusion of environmental and experimental covariates, the multi-species occupancy framework can be applied for direct ecological assessment and to improve the methodology for next-generation biodiversity assessment. Metabarcoding results are often presented as taxonomic lists of species presence with alpha and beta-diversity estimates, and

PLOS ONE
sampling effort is often assessed using accumulation curves (e.g. [48]). However, these methods do not account for imperfect detection and cannot accommodate the many variables in the field and the lab that can impact these results. From an ecological perspective, environmental variables (e.g. temperature, salinity, turbidity) can be included at the level of occupancy to determine their effects on community diversity and the presence of individual species. From a methodological perspective, environmental and experimental variables (e.g. sample volume, sequencing depth) can be included at the level of field sampling and technical replication to understand how these factors affect metabarcoding results. Understanding the effects of these covariates facilitates the development of more robust experimental and survey designs. Furthermore, simulations using occupancy models can be used to optimize sampling effort, enabling practitioners to fine-tune the trade-off between field sampling and lab work [21]. The number of sites, biological samples, and technical replicates can all be optimized to maximize the species richness recovered from eDNA samples while minimizing effort. PCR level stochasticity, which is known to affect sequencing results [46,49], was not considered in our case study (i.e., PCR replicates were pooled before sequencing) but PCR replicates can easily be included as technical replicates in the model described here. PCR replicates are commonly included separately in single-species occupancy models for eDNA data [19,20,27]. By including PCR replicates as technical replicates, additional stochasticity in the sampling process can be accounted for, further improving inferences.
A key advantage of the occupancy modeling framework demonstrated here is its flexibility. Modifications to the model can allow several additional factors to be included, and a priori information can be used to guide model development. For example, multiple sampling periods have been included in dynamic, multi-season occupancy models to quantify temporal changes in community structure (e.g. [22]). Repeated eDNA sampling for metabarcoding could be modelled similarly to account for local extinction and colonization events between sampling periods. In addition to accounting for false negatives, several studies have developed methods for including false positives in occupancy models [50][51][52]. False positives may potentially arise from metabarcoding data through sequencing errors, PCR errors, and poor reference database coverage or quality [15,53,54]. Strict bioinformatic filtering helps to minimize the inclusion of these errors in resulting data sets; however, the possibility of false positives cannot be eliminated. Our model did not consider false positives, and, to our knowledge, these have yet to be incorporated into multi-species occupancy models. Following current protocols, abundance estimates from metabarcoding data are not reliable [55,56], but occupancy models can provide a means to estimate trends in abundance from the presence-absence data generated by metabarcoding based on documented relationships between occupancy and abundance [57][58][59]. The hierarchical modeling framework used in occupancy modeling can also be adapted to include or estimate taxa abundances [31]. As more research is done to understand the relationship between sequence read counts and species biomass, read counts could potentially be used as taxon abundances in a hierarchical modelling framework to estimate species biomass.
We demonstrate for the first time how a multi-scale, multi-species occupancy modelling framework can be used in a natural system to account for imperfect detection and allow for critical assessment of experimental and environmental factors influencing biodiversity data from eDNA metabarcoding. Despite the utility of these models for improving detection and targeting areas of variation in the pipeline from sample collection to sample processing, this approach has been underutilized in DNA metabarcoding studies (S1 Table; but see [25]). This multi-species occupancy modelling framework will be particularly useful for bioassessment studies using DNA metabarcoding because it will improve estimates of occupancy and species richness, aid in optimizing sampling efforts in the field and lab, and, using the model assessment methods described here, identify ecological and environmental factors affecting occupancy, capture, and detection probabilities. Given the high stakes for documenting and understanding biodiversity that is under increasing anthropogenic threat [60] and declining [61] globally, new tools are imperative for rapid bioassessment [7,62,63]; yet, like any emergent technology, there is the potential to misuse these tools [64], which can have unforeseen consequences (e.g. [65]). In the case of DNA metabarcoding, neglecting to assess imperfect detection at key points along the sample collection and processing pipeline could lead to failure to detect species of interest, biased estimates of species richness, and miscalculations of species distributions, all of which have consequences for conservation and management [24, 66,67]. We recommend incorporating multi-scale, multi-species occupancy modeling into the design and analysis of future metabarcoding studies.
Supporting information S1 Table. Literature review of occupancy modeling for metabarcoding data. List of studies (n = 5) that have examined occupancy models in the context of DNA metabarcoding. To obtain this list of publications, we performed the following systematic, Boolean search using the Web of Science: " � DNA" AND "metabarcoding" AND "occupancy model � ". (DOCX) S2 Table. Species-specific parameter estimates from multi-species, multi-scale occupancy model. Species-specific estimates of occupancy, capture and detection using mean covariates values (water depth (m) and sequencing depth) from a multi-scale, multi-species occupancy model for eDNA metabarcoding data from Conception Bay, Newfoundland.