Predicting provenance of forensic soil samples: soil DNA predicts habitat and environmental properties

Environmental DNA is increasingly applied in ecological studies, including forensic ecology where eDNA from soil can be used to pair samples or reveal sample provenance. We collected soil eDNA samples as part of a large national biodiversity research project across 130 sites in Denmark. We investigated the potential for soil eDNA in predicting provenance in terms of environmental conditions, habitat characteristics and geographic regions. We used linear regression for predicting environmental gradients of light, moisture, soil pH and nutrients (represented by Ellenberg Indicator Values, EIVs) and quadratic discriminant analysis (QDA) to predict habitat class and geographic region. We found high predictive power for environmental gradients (R2 > 0.73). The discriminatory power of QDA in predicting habitat characteristics varied from high accuracy in predicting certain forest types, less accurate prediction of heathland and poor accuracy for geographic region. We demonstrate the application of provenance prediction in forensic science by evaluating and discussing two mock crime scenes. Here, we supplement with plant species lists from annotated sequences. Where predictions of environmental gradients and habitat classes give an overall accurate description of a crime scene, care should be taken when interpreting annotated sequences, e.g. due to erroneous assignments in GenBank. The outlined approach clearly demonstrates that basic ecological information that can be extracted from soil eDNA, contributing to the range of potential applications of eDNA in forensic ecology.

Introduction stratified according to gradients in soil fertility, soil moisture and successional stage. We deliberately 88 excluded saline and fully aquatic habitats, but included temporarily inundated depressions as well as 89 wet mires and fens. The final set of 24 environmental strata consisted of the following six cultivated Soil eDNA metabarcoding 105 We collected soil from all sites and subjected it to metabarcoding through DNA extraction, PCR 106 amplification of genetic marker regions (DNA barcoding regions) and massive parallel sequencing 107 on the Illumina platform as described in Brunbjerg,Bruun (22). The soil sampling scheme included 108 the mixing of 81 soil cores from each site in an attempt to get a representative sample. For this study, 109 we used sequencing data from genes amplified with primers targeting eukaryotes, fungi, plants and 110 insects. For eukaryotes we amplified part of the 18S region with primers 18S_allshorts (23, 24) with a slight modification of the forward primer (TTTGTCTGGTTAATTCCG). For fungi we amplified 112 the ITS2 region with primers gITS7 (25) and ITS4 (26). For plants we amplified the ITS2 region with 113 S2F (27) and ITS4. For insects, we amplified part of the 16S region with primers Ins_F and Ins_R 114 (28).

115
The bioinformatic processing of the sequence data followed the strategy outlined in (22) based on 116 the identification of exact sequence variants (29) by using the DADA2 pipeline (30) including post-117 clustering curation with the LULU algorithm (31) for the fungi, plants and insects in order to obtain 118 reliable alpha diversity estimates for OTU data. OTUs that could not be assigned to the target group 119 were removed from the datasets. Taxonomic affiliation was assessed by using the most widely applied 120 name among the top hits (within 0.5%) from blastn against GenBank.  To obtain meaningful predictors from OTU diversity, we used the first four ordination axes 126 obtained from NMS-ordination (function metaMDS in package vegan, 32, default settings) of the 127 four OTU community datasets, using abundance data (number of sequences per OTU) with square fertility/nutrients, soil moisture, pH and light conditions (EIV N, M, R and L), which have been shown 135 to be very useful for describing the variation in this dataset (see Fig. 3 in 22). 136 We used linear models to evaluate the relationship between OTU composition and environmental 137 gradients (see Appendix S1). Model selection was based on AIC using the function stepAIC and 138 backwards selection in package MASS (37, see Table S1). Individual models were constructed for 139 each set of NMS axes based on the different primers and normality and heterogeneity were assessed 140 by visual inspection of qq-plots, residual plots and histograms of residuals and indicated no problems 141 with the final model. The best set of NMS axes was selected to predict EIV and 95% CI using leave-142 one-out cross validation. All analyses were performed in R-3.4.2 (38).

143
While most trained botanists and ecologists may be familiar with EIV, to most people they are 144 hard to interpret in terms of meaningful vegetation types of environmental conditions. We have 145 therefore aided interpretation by graphing the location of common vegetation types along the four 146 Ellenberg gradients used in this study (Fig 1).  Predicting binary habitat classes 152 Another aim was to describe habitat classes typical of Denmark from the OTU composition in soil 153 samples. We selected classes that are familiar to most people, easy to identify by non-ecologists and 154 possible to recognize from a distance or from publicly available maps and orthophotos with some 155 training ( Table 1). The habitat classes were recorded as binary variables for each site.  correctly predicted tends to be high solely because of a high percentage true negatives. Therefore, we 168 also evaluated model performance on the percentage of true positives.
We also investigated the possibility to predict geographic origin defined as a binary response 170 variable, i.e., mainland/islands and Atlantic/Continental biogeographic regions of Denmark, using 171 the above mentioned predictors.

172
The best model was then selected based on the highest percentage correctly predicted and when 173 that was very similar, we also evaluated the percentage true positives.

174
For the habitat classes that were characterized by one or a few species, we tested if adding the   presented to investigators in a meaningful way (Fig. 2). For mock crime scene 1, the predicted EIVs The geographic distribution of M. uniflora in Denmark is limited to the eastern parts (i.e., the 244 continental region, Table S1). The mock crime scene in this case was an old growth beech forest on 245 moder soil in Zealand, Eastern Denmark. Denmark extends c. 300 km from east to west. Recently, spectral analysis, another discipline of 281 forensic geoscience, was shown to successfully predict geographic origin at local scales in a cultural 282 landscape (11). It is possible that eDNA also will perform better at predicting local scale provenance 283 within a specific urban landscape, which may be characterized by novel communities and introduced 284 species (43).

285
Variation in eDNA predicted environmental gradients, i.e., EIVs, in linear models with R 2 >0.73. 286 We used EIVs for light conditions, soil moisture, nutrient conditions and soil pH (33). Averaging mosaics. Performance of our classification models tended to be best for habitats that are relatively 297 well defined and delimited along these gradients, e.g., high forest and forest. On the other hand, we 298 defined heathland broadly to include both wet and dry heathland. Similar communities can be found in mires, grasslands and plantations (see also the overlap in the ellipses in Figure 1)  should be used in the ecological inference to ensure validity, accuracy and reproducibility.  underrepresented. We know little about the seasonal variation in eDNA (but see e.g., 47) and variation 346 in eDNA with soil depth (see e.g., 48). Moreover, forensic soil samples can be minute and contaminated, dried, old and degraded, and we need to explore the model sensitivity and provenancing accuracy of such samples (see also 10).

349
Here we demonstrate a new application of eDNA and show that the generated results can be used