Metabarcoding lake sediments: taphonomy and representation of contemporary vegetation in environmental DNA (eDNA) records

Metabarcoding of lake sediments may reveal current and past biodiversity, but little is known about the degree to which taxa growing in the vegetation are represented in environmental DNA (eDNA) records. We analysed composition of lake and catchment vegetation and vascular plant eDNA at 11 lakes in northern Norway. Out of 489 records of taxa growing within 2 m from the lake shore, 17-49% (mean 31%) of the identifiable taxa recorded were detected with eDNA. Of the 217 eDNA records, 73% and 12% matched taxa recorded in vegetation surveys within 2 m and up to about 50 m away from the lakeshore, respectively, whereas 16% were not recorded in the vegetation surveys of the same lake. The latter include taxa likely overlooked in the vegetation surveys or growing outside the survey area. The percentages detected were 61, 47, 25, and 15 for dominant, common, scattered, and rare taxa, respectively. Similar numbers for aquatic plants were 88, 88, 33 and 62%, respectively. Detection rate and taxonomic resolution varied among plant families and functional groups with good detection of e.g. Ericaceae, Roseaceae, deciduous trees, ferns, club mosses and aquatics. The representation of terrestrial taxa in eDNA depends on both their distance from the sampling site and their abundance and is sufficient for recording vegetation types. For aquatic vegetation, eDNA may be comparable with, or even superior to, in-lake vegetation surveys and therefore be used as an tool for biomonitoring. For reconstruction of terrestrial vegetation, technical improvements and more intensive sampling is needed to detect a higher proportion of rare taxa although DNA of some taxa may never reach the lake sediments due to taphonomical constrains. Nevertheless, eDNA performs similar to conventional methods of pollen and macrofossil analyses and may therefore be an important tool for reconstruction of past vegetation.


Introduction
All lakes are in northern Norway. Water depth given for sampling site in the centre of the lake; deepest point in brackets if different. "Summer" is May-September, "Alt." is altitude, "prec." is precipitation, and "N lat." and "E. lat" are northern and eastern latitude, respectively. Mixed forest is forest dominated by birch but with some Pine.

DNA extraction and amplification
For each lake, we analysed the top 0-2 cm of sediment separately from two of the three core 150 tubes (n=22). Twenty extra samples from lower in the cores were also analysed. The main down-core 151 results will be presented in a separate paper in which we compare eDNA records with the pollen 152 analyses by [49]. Taxa that were only identified from lower levels in the cores are noted in S1 Table. 153 Samples were thawed in a refrigerator over 24-48 hours, and 4-10 g were subsampled for DNA. The  Table). We then used ecotag program [57] to assign the sequences to taxa by comparing them 170 against a local taxonomic reference library containing 2445 sequences of 815 arctic [34] and 835 171 boreal [36] vascular plant taxa; the library also contained 455 bryophytes [44]. We also made 172 comparisons with a second reference library generated after running ecopcr on the global EMBL were kept. We excluded sequences matching bryophytes as we did not include them in the vegetation 175 surveys. We used BLAST (Basic Local Alignment Search Tool) (http://www.ncbi.nlm.nih.gov/blast/) 176 to check for potential wrong assignments of sequences. 177 178 When filtering next-generation sequencing data, there is a trade-off between losing true 179 positives (TP, sequences present in the samples and correctly identified) and retaining false positives 180 (FP, sequences that originate from contamination, PCR or sequencing artefacts, or wrong match to 181 database) [17,20,58]. We therefore assessed the number of TP and FP when applying different last 182 step filtering criteria. We initially used two spatial levels of comparison with the DNA results: i) data 183 from our vegetation surveys and ii) the regional flora (i.e., species in the county of Nordland and 184 Troms as listed by the Norwegian Bioinformation Centre (http://www.biodiversity.no/). For any lake, 185 both datasets are likely incomplete, as inconspicuous species may be lacking in the regional records 186 [59] and our vegetation surveys did not include the entire catchment area. Nevertheless, the exercise is 187 useful for evaluating how many FPs and TPs are lost by applying different filtering criteria. We 188 defined true positives as sequences that matched a species recorded in the vegetation surveys at the 189 same lake, being aware that this is an under-representation, as the vegetation surveys likely missed 190 species. We defined false positives as species recorded neither in the vegetation surveys nor the 191 regional flora. We tested the effect of different rules of sequence removal: 1) found as ≤1,≤5 or ≤10 192 reads in a PCR repeat, 2) found as ≤1,≤2 or ≤3 PCR repeats for a lake sample, 3) occurring in more 193 than one of 72 negative control PCR replicates, 4) on average, higher number of PCR repeats in 194 negative controls than in sample, and 5) on average a higher number of reads in negative controls than 195 in samples (S2 Table). The filtering criteria resulting in overall highest number of true positives kept 196 compared to false positives lost were applied to all lakes. These were removing sequences with less 197 than 10 reads, less than 2 PCR repeats in lake samples, and on average a lower number of reads in lake 198 samples than in negative controls. 199

Data analyses and statistics 201
After data filtering, we compared taxon assemblages from DNA amplifications with the taxa 202 recorded in the vegetation surveys. To make this comparison, taxa in the vegetation surveys were 203 lumped according to the taxonomic resolution of the P6 loop (S1 Table), and the comparison was done 204 at the lowest resultant taxonomic level. The majority of results explore only present/absence (taxa 205 richness); quantitative data are given in tables (including Supporting Information). positive as pDNA_0, the detectability by DNA as pDNA_1, the detectability in the vegetation survey as 216 pVEG_1, and the probability that a species is present as pOCC, we can state that the four probabilities of 217 12 228 3. Prob(DNA=1,Vegetation=0) = (1-pOCC) pDNA_0 + pOCC pDNA_1  In this case, the species is either absent and is a false DNA positive, or is present, detected by DNA 230 but not in the vegetation survey. 231 In this case, the species is present and is detected both in the DNA and the vegetation survey. 234

235
We assumed the four probabilities varied only among lakes, not among species. We also 236 restricted the analyses to species that were detected at least once using DNA, because for species that 237 were never detected using eDNA, different processes might be important. For pDNA_1 , we also 238 considered a model assuming a logistic relationship between pDNA_1 and lake characteristics, such as 239 lake depth or catchment area, that is: logit(pDNA_1) = b0 + b1 Lake Covariate. We fitted these models Eight taxa of Equisetum were filtered out due to short sequence length. This left 171 taxa that could 254 potentially be recognized by the technique we used (S1 Table). For the 11 sites, between 31 and 58 255 taxa were potentially identifiable (Table 2), and this value was positively correlated with vegetation 256 species richness (y=0.67x+10.3, r 2 =0.93, p<0.0001, n=11). Taxonomic resolution at species level was 257 77-93% (mean 88) and 65-79 (mean 74%) for the <2 m and extended (i.e., combined) vegetation 258 surveys, respectively. 259 Taxa in the vegetation surveys (Veg.), number of taxa that could potentially be identified with the applied molecular marker used and available reference 261 database, and taxa actually identified in the eDNA. The results are given for vegetation surveys <2 m from lakeshore (including aquatics) and for additional 262 taxa recorded in extended surveys. Raw reads refer to all reads assigned to samples (S1 Table). The ratio between the highest and lowest value on each 263 category is given as a indicator of variation among lakes 264 Of 489 records <2 m from the lakeshore, the majority were rare (148) or scattered (146)  The numbers of sequences matching entries in the regional arctic-boreal and EMBL-r117 270 databases were 227 and 573 at 98% identity, respectively. For sequences matching both databases, we 271 retained the arctic-boreal identification; this resulted in 11,236,288 reads of 301 sequences matched at 272 100% similarity with at least 10 reads in total (S2 Table). There were 244 and 181 records of taxa that 273 with certainty could be defined as true or false positive, respectively (see methods). We found no 274 combination of filtering criteria that only filtered out the false positives without any loss of true 275 positives (S3 Table). The best ratio was obtained when retaining sequences that were on average more 276 common in samples than in negative controls, plus with at least two replicates in one sample and at 277 least 10 reads per replicate. Applying these criteria filtered out 163 false positives leaving only 18 278 records of three false positive taxa (Annonaceae, Meliaceae and Solanaceae), which were then 279 removed as obvious contamination. However, it also removed 61 (25%) true positives, e.g., Pinus, 280 which had high read numbers at lakes in pine forest and low ones at lakes where it is probably brought 281 in as firewood, but which also occurred with high read numbers in two of the negative controls (S4 282   Table). After matching against the local vegetation, 2,500,138 reads of 56 sequences remained. 283 Sequences representing the same taxa were merged, resulting in 47 final taxa (Table 3). Taking into 284 account some taxa shared sequences, e.g. Carex and Salix, these may potentially represent 81 taxa (S1 285   Table). 286 are likely to represent false positives. For taxa only recorded in vegetation and/or filtered out of the 292 eDNA records, see S1 Table. The lakes names 300   Table). 301

302
The gain in number of taxa when analysing two cores instead of one was 2.5±1.2 per lake. All 303 data presented below are based on the upper 0-2 cm of sediment of two cores combined (but not from 304 deeper levels as these were not sampled at all sites). This gave an average of 19.7±6.9 taxa (10-30) per 305 lake (Table 2). Samples from below 2-cm depth provide an additional 14 records of 42 taxa, some not 306 recorded in 0-2 cm samples (S1 Table). were also correctly identified at most lakes although often at genus level. In contrast, Poaceae and 334 Cyperaceae, which were common to dominant around most lakes, were underrepresented in the DNA 335 records. Juncaceae and Asteraceae, which were present at all lakes, although mainly scattered or rare, 336 were mainly filtered out due to presence in only one PCR repeat or only in samples from 2-8 cm depth 337 (S1-S4 Tables). 338

339
The numbers of taxa recorded in vegetation, in eDNA, and as match between them varied two-340 to six-fold among lakes (Table 2, Fig 2d). Jula Jávri had the lowest match between eDNA and 341 vegetation with only four taxa in common. Lakselvhøgda and Lauvås had extremely low read numbers 342 after filtering. For Lauvås, Finnvatnet and Lakselvhøgda, 84%, 30% and 20%, respectively, of raw 343 reads were allocated to algae. If we assume that a big unidentified sequence cluster also represents algae, this increases to 69% for Lakselvhøgda, where a 15-20 cm algal layer was observed across most 345 of the lake bottom. A lake-bottom algal layer was also observed at Jula Jávri, and in this we suspect 346 that an unidentified cluster of 170,772 reads was algae. In most other lakes, algal reads were 3-15% 347 (0.2% in Brennskogtjern, the lake with highest numbers of reads after filtering; algal data not shown). 348 349 Thirty-three records of 17 DNA taxa did not match vegetation taxa at a given lake (Table 3). 350 These include taxa that are easily overlooked in vegetation surveys due to minute size (e.g., Sagina 351 sp.), or only growing in deeper parts of the lake (e.g., Potamogeton praelongus). Other taxa are 352 probably confined to ridge-tops of larger catchments, which lay outside the survey areas (e.g., 353 Cassiope tetragona and Dryas octopetala). Two tree species that occur as shrubs or dwarf shrubs at 354 their altitudinal limits, Alnus incana and Populus tremula, were found in the DNA at high-elevation 355 sites. Also, ferns were detected at several sites where they were not observed in the vegetation surveys. 356 On balance, most mismatches probably relate to plants being overlooked in the vegetation surveys or 357 growing outside the survey area, whereas Chamaedaphne calyculata likely represents a false positive 358 (Table 3, S1 Appendix). 359

360
The multivariate ordinations gave similar results for the vegetation and eDNA records with the 361 only lake from Pine forest, Brennskogtjønna, and one of the two alpine lakes, Paulan Jávri, clearly 362 distinguished on the first axis, whereas the lakes with varying cover of birch forest were in one cluster 363 (Fig 3a-b). The other alpine lake, Jula Jávri, was only distinguished on the vegetation, probably due to 364 the low number of taxa identified in the eDNA of this lake ( Table 2) Table). Thus, on average, about three species may have been overlooked at each 379 lake. The posterior probability that taxa recorded in the vegetation surveys and detected at least once 380 by eDNA were also recorded in the DNA in a given lake (true positives) was 0.33-0.90, whereas the 381 posterior probability of any DNA records representing a false positive varied from 0.06-0.33 per lake 382 (S6 Table). There was evidence that the probability of detecting a species using eDNA (pDNA_1) was 383  Taking into account the limitation of taxonomic resolution due to sequence sharing or taxa 395 missing in the reference library, we were able to detect about one third of the taxa growing in the 396 immediate vicinity of the lake using only two small sediment samples from the lake centre. The large number of true positives lost (S1 Appendix) suggests that this proportion may be further improved. 398 Nevertheless, the current approach was sufficient to distinguish the main vegetation types. 399 400

401
The high proportion of taxa in the <2 m survey detected with eDNA than in the extended 402 surveys indicates that eDNA is mainly locally deposited. The observation of taxa not recorded in the 403 vegetation surveys but common in the region (Fig 3, S1 Table) indicates that some DNA does 404 originate from some hundreds of meters or even a few km distant. Indeed, a higher correlation 405 between catchment relief and total eDNA (R 2 =0.42) than eDNA matching records in the vegetation 406 (R 2 =0.34), may suggest that runoff water from snow melt or material blown in also contributes. Thus, 407 the taphonomy of eDNA may be similar to that of macrofossils [66, 67], except that eDNA may also 408 be transported via non-biological particles (e.g. fine mineral grains). From other studies, pollen does 409 not appear to contribute much to local eDNA records [15,35,37,42,47]. This is probably due to its 410 generally low biomass compared with stems, roots and leaves, and to the resilience of the 411 sporopollenin coat, which requires a separate lysis step in extraction of DNA [68]. 412

413
The higher proportion of eDNA taxa that matched common or dominant taxa in the 414 vegetation, compared with taxa that were rare or scattered, was as expected, as higher biomass should 415 be related to a greater chance for deposition and preservation in the lake sediments [9]. Yoccoz et al. 416 [6] found the same in their comparison of soil eDNA with standing vegetation. While some dominant 417 taxa were filtered out in our study, their DNA was mainly present (S1 Appendix, S1-4 Tables), and 418 most dominant taxa were recorded in all PCR replicates (not shown). Thus, for studies where the focus 419 is on detecting dominant taxa, running costs may be reduced by performing fewer PCR replicates.

Variation among lakes 422
The variation among lakes seen in DNA-based detection of taxa shows that even when 423 identical laboratory procedures are followed, the ability to detect taxa can vary. Our sample size of 11 424 lakes does not allow a full evaluation of the reasons for this variation. Factors such as low pH or 425 higher temperature may increase DNA degradation [16], but the two lakes with lowest numbers of 426 reads after filtering in our study, Lakselvhøgda and Lauvås, had pH values close to optimal for DNA 427 preservation (7.2 and 6.8, respectively, I.G. Alsos and A.G. Brown, pers. obs. 2016), and variation in 428 temperature was low among our sites. The lack of an inflowing stream at Lakselvhøgda may reduce 429 the supply of eDNA, but Lauvås has two inflows. For these two lakes, and to a lesser extent 430 Finnvatnet, we suspect high algal abundance might have caused PCR competition [69]. PCR 431 competition may also occurred in samples from Jula Jávri, but in this case we were not able to identify 432 the most dominant cluster of sequences. These lakes are also small and shallow. Variation among 433 eDNA qualities has also been observed in a study of 31 lakes on Taymyr Peninsula in Siberia [70]. We 434 suspect that high algae production may be a limiting factor as we also have seen poor aDNA results in 435 samples with high Loss on Ignition values, but this should be studied further. A potential solution to 436 avoid solution to avoid PCR competition may be to design a primer to block amplification of algae as 437 has been done for human DNA in studies of mammals eDNA [71]. which has no primer mismatch, is regularly detected in ancient DNA studies [15,[36][37][38]41], and was 449 present in nine lakes, although most records were filtered out due to occurrence in negative controls. 450 To avoid any bias due to primer match and potentially increase the overall detection of taxa, one 451 solution would be to use family-specific primers, such as ITS primers developed for Cyperaceae, 452 Poaceae, and Asteraceae [36]. Alternatively, shotgun sequencing could be tested as this minimizes 453 PCR biases [76,77]. herbivore diet [44]. 461

462
The general over-representation of spore plants in eDNA among taxa only found >2 m from 463 the lake and those not recorded in the catchment vegetation raises the question as to whether eDNA 464 can originate from spores. Spore-plant DNA is well represented in some studies [42,78], is lacking in 465 other studies [15, 37] and has been found as an exotic in one study [41]. As with pollen, the protective 466 coat and low biomass of spores suggest that they are an unlikely source of the eDNA. This inference is 467 supported by clear stratigraphic patterns shown by fern DNA in two lake records from Scotland. 468 Records are ecologically consistent with other changes in vegetation, whereas spores at the same sites 469 show no clear stratigraphy [42]. Preferential amplification could be an alternative explanation, but this 470 is not likely as the amplification of fern DNA from herbarium specimens is poor [34]. It is possible 471 that in some cases, including this study, we are detecting the minute but numerous gametophytes 472 present in soil, which would not be visible in vegetation surveys. 473 Aquatic taxa were detected in all lakes, and they have been regularly identified in eDNA 475 analyses of recent [42] and late-Quaternary lake sediments [15,37,38]. eDNA may be superior to 476 vegetation surveys in some cases, e.g., Potamogeton praelongus, which is characteristic of deeper 477 water https://www.brc.ac.uk/plantatlas/) and was likely overlooked in surveys due to poor visibility. 478 Callitriche hermaphroditica was observed in two lakes (Einleten and Jula javri), whereas C. palustris 479 was observed at Einleten. We cross-checked the herbarium voucher and the DNA sequence and both 480 seems correct, so potentially both were present but detected only in either eDNA or vegetation 481 surveys. Overall, eDNA appears to detect aquatic plants more efficiently than terrestrial plants, which 482 is not unexpected as the path from plant to sediment is short. The potential taxonomic resolution (i.e., for eDNA taxa to be identified to species level) was 504 similar or higher than that for macrofossils [80]  five previous studies of late-Quaternary sediments that compared aDNA with macrofossils and seven 520 that did so with pollen; these showed rather poor richness in aDNA compared to other approaches 521 (reviewed in [10]). We think a major explanation may be the quality and size of available reference Our study supports previous conclusions that eDNA mainly detects vegetation from within a 531 lake catchment area. Local biomass is important, as dominant and common taxa showed the highest 532 probability of detection. For aquatic vegetation, eDNA may be comparable with, or even superior to, 533 in-lake vegetation surveys. Lake-based eDNA detection is currently not good enough to monitor 534 modern terrestrial plant biodiversity because too many rare species are overlooked. The method can, 535 however, detect a similar percentage of the local flora as is possible with macrofossil or pollen 536 analyses. As many true positives are lost in the filtering process, and as even higher taxonomic 537 resolution could be obtained by adding genetic markers or doing full genome analysis, there is the 538 potential to increase detection rates. Similarly, results will improve as we learn more about how 539 physical conditions influence detection success among lakes, and how sampling strategies can be 540