Taxonomic Reference Libraries for Environmental Barcoding: A Best Practice Example from Diatom Research

DNA barcoding uses a short fragment of a DNA sequence to identify a taxon. After obtaining the target sequence it is compared to reference sequences stored in a database to assign an organism name to it. The quality of data in the reference database is the key to the success of the analysis. In the here presented study, multiple types of data have been combined and critically examined in order to create best practice guidelines for taxonomic reference libraries for environmental barcoding. 70 unialgal diatom strains from Berlin waters have been established and cultured to obtain morphological and molecular data. The strains were sequenced for 18S V4 rDNA (the pre-Barcode for protists) as well as rbcL data, and identified by microscopy. LM and for some strains also SEM pictures were taken and physical vouchers deposited at the BGBM. 37 freshwater taxa from 15 naviculoid diatom genera were identified. Four taxa from the genera Amphora, Mayamaea, Planothidium and Stauroneis are described here as new. Names, molecular, morphological and habitat data as well as additional images of living cells are also available electronically in the AlgaTerra Information System. All reference sequences (or reference barcodes) presented here are linked to voucher specimens in order to provide a complete chain of evidence back to the formal taxonomic literature.


Introduction
Diatoms are unicellular and usually photoautotroph micro algae which are responsible for about 25% of global CO 2 fixation [1][2][3] and contribute approximately 20% of the global net primary production [4].
Diatoms are important bioindicators for monitoring water quality because they are sensitive to changes in pollution, nutrient availability, acidity and salinity, e.g. [5,6]. They are the most ubiquitous group within the microscopic algae as they occur in all types of water bodies and play an important part in benthic and planktonic biocoenoses [7]. They are routinely used as bioindicators within the EU Water Framework Directive (WFD) as well as in water quality monitoring worldwide [8][9][10][11][12][13].
Each diatom cell is encased in two siliceous shells (frustules) that are connected by girdle bands [1][2][3]. Current identification of diatoms is based on a morphological and mostly descriptive species concept (Zimmermann et al. subm.) and relies exclusively on micro-characters of the frustule such as size, symmetry, shape, and sculpture which can be seen by light microscopy [14]; more detailed analyses of the siliceous structures lead to more and more refined differentiation of species, which is possible through the development of higher resolution techniques, e.g. electron microscopy.
Identification via microscopy is challenging and time consuming, especially for routine use [15], and relies on individual taxonomic expertise. Therefore different taxonomists could arrive at different conclusions, depending i.a. on the taxonomic concept, species with limited diagnostic morphological features, cryptic species, available reference floras and quality of microscopes used by each individual researcher [15] as well as unavailability of adequate descriptions.
The application of molecular markers for taxon identification -DNA barcoding -is an emerging method which has the potential to be faster, universally applicable and generate reliable identification. Furthermore, as it uses DNA sequences for identification, it is independent of pre-existing morphological species concepts and can be linked to any taxonomic concept [16]. However, correct identification relies fundamentally on the quality of the reference library the DNA barcodes are checked against. DNA barcoding is based on the assumption that sequences of a certain marker locus exhibit enough variation between species to be discriminative for unambiguous species discovery [17,18]. DNA barcoding is also a useful tool to access concealed diversity e.g. [19][20][21][22][23][24][25]. DNA barcoding in combination with next generation sequencing techniques also allows for the description of community compositions through the large numbers of sequences generated by this approach e.g. [26,27]. A schematic overview on environmental DNA barcoding of diatoms and the establishment of a reference library is given in Fig. 1.
The requirement for reliable taxon identification by DNA barcode(s) is an unambiguous link between the genotype and the phenotype (or morphotype) to which the name of the species is attached. This means that a reference library consisting of taxon names belonging to specimens that have been identified by experts as well as providing descriptions together with barcode sequences, which were derived from well documented strains (e.g. voucher deposition, sampling localities and collectors, basic environmental data, high-resolution LM pictures, morphometrics, taxonomy and nomenclature, maps, literature and references to databases where this data is deposited) for every single species is necessary. For unicellular diatoms, clone cultures (strains) need to be established which offer enough material for sequencing as well as for identification by light and electron microscopy. Once established and linked to a taxonomic reference library, the DNA barcoding method could offer a time and cost efficient alternative/extension to microscopic identification for routine applications by limiting morphological taxonomy to critical groups which feature a distinct genetic aberration to known and identified organisms in the library.
Recently, the CBOL Protist Working Group [28] has designated the 18S V4 rDNA marker region as first or pre-barcode for Protist organisms. In this paper, we follow the 18S V4 protocols designed for diatoms by Zimmermann et al. [19], and present 70 strains for which this pre-barcode (18S V4) as well as a second widely used barcode, rbcL [20,21,29], has been generated. The reference library includes these two DNA barcodes, the respective taxon name, images, morphometric and geographic data as well as vouchers for further reference. Further data and additional images also of living cells are available electronically through the AlgaTerra Information System [30]. We demonstrate the benefits of a well documented reference library for DNA barcoding for identification, taxonomy, phylogeny, and further scientific analyses on an exemplary group. This paper focuses on naviculoid diatom strains from Berlin waters since its diatom flora has been well studied for almost two centuries by light microscopy [31] and a recent diatom flora is available for water quality assessments [32].

Sampling
Benthic samples from which the 70 strains were established were collected at 11 sites in the catchment area of Berlin (Fig. 2); one additional sample was from the River Elbe, downstream of the Berlin Rivers Spree and Havel. Conductivity of Berlin water ranges mostly between 400 to 900 mS cm 21 , pH is frequently 6,5 to 9 (80% respectively 88% of about 300 measurements of Berlin water samples, Kusber unpubl. data). For samples, sites, dates, collectors of the samples and isolators of the strains see Table 1. No specific permissions were required for the sampled locations/ activities. The field studies did not involve endangered or protected species.

Cultivation
The diatom cells were isolated from environmental water samples observed under a stereo light microscope using capillary glass pipettes. The respective cell was then transferred to a 5 cm diameter plastic petri dish containing autoclaved habitat water and/or culture medium (WC [33], Chu [34], AlgaGrow, Plagron, Weert, Netherlands) of adequate salinity and pH. In order to remove unwanted particles, this treatment was repeated several times until microscopic inspection confirming that a culture derived from one cell, but not axenic had been established. The cultures were grown at a temperature between 18-22uC and a 12 h day/night cycle.

Preparation of frustules
By the time of harvesting the cultures, one fraction was used for obtaining DNA (see below) and the other part was cleaned with H 2 0 2 at 80uC and rinsed several times with H 2 0. A few drops of the resulting suspension of diatom frustules were dried on a cover slip and embedded as slides in Naphrax for study in LM or on stubs if for SEM. Vouchers of each strain were deposited in the Herbarium Berolinense (B) (see Table 2).

Light and electron microscopy
The LM pictures were acquired with a Zeiss Axio Imager.M2 with an implemented AxioCam HRc (Zeiss, Oberkochen, Germany). SEM pictures were produced with Philips SEM 515 operating at 30 KV (Philips, Eindhoven, The Netherlands), and Hitachi 8010 Field Emission Electron Microscope (Hitachi, Tokyo, Japan).

DNA isolation
The harvested cultures were transferred to 1.5 ml tubes. DNA was isolated using Dynal DynaBeads (Invitrogen Corporation; Carlsbad, CA, USA), NucleoSpin Plant II Mini Kit (Machery and Nagel, Düren, Germany) or Qiagen Dneasy Plant Mini Kit (Qiagen Inc.; Valencia, CA) following the respective product instructions. DNA concentrations were checked using gel electrophoresis (1.5% agarose gel) and Nanodrop (PeqLab Biotechnology LLC; Erlangen, Germany). DNA samples were stored at 220uC until further use. DNA material was deposited in the Berlin collection of the DNA bank network [39].

PCR amplification
The V4 region of the 18S locus was amplified in all strains with the primer pair M13F-D512 for 18S/M13F-D978rev 18S [19]. The rbcL locus was amplified in two overlapping parts using two different primer pairs; Diat-rbcL-F and Diat-rbcL-iR as well as Diat-rbcL-iF and Diat-rbcL-R [40] for all strains. The polymerase chain reaction (PCR) for the V4 region was conducted after Zimmermann et al. (2011) [19] and for rbcL carried out after Abarca et al. (2014) [40]. PCR products were visualised in a 1.5% agarose gel and cleaned with MSB Spin PCRapace (Invitek LLC; Berlin, Germany) following standard procedure. DNA content was measured using Nanodrop (PeqLab Biotechnology). The samples were normalised to a total DNA content .100 ng/ml using Nanodrop (PeqLab Biotechnology) for further sequencing.

Molecular analysis
The aligned sequences were compared to each other calculating uncorrected p distances in PAUP [45]. Then they were blasted against existing INSDC entries for the respective taxa (accessed July 2013). All INSDC accessions with references are given in Appendix S1. Base pair differences were counted in overlapping parts of the sequences in Mega 5 [46]. Results are summarised in Table 3.

Tree building
To identify molecular relations between the here presented strains, trees were calculated with Mega 5 using the Neighbour Joining algorithm with gamma distributed rates among sites followed by a statistical test of the tree topologies with 10 000 bootstrap replications. Trees for the individual alignments of 18S V4 and rbcL sets as well as a concatenated dataset were calculated.
Furthermore, we created 18S V4 as well as rbcL datasets including INSDC sequences for the genera Amphora, Mayamaea, Planothidium and Stauroneis to exemplarily test the taxonomic consistency of available sequences as well as the placement of our new taxa. Each of these eight datasets was analysed under the aforementioned conditions.

Nomenclature
The electronic version of this article in Portable Document Format (PDF) in a work with an ISSN or ISBN will represent a published work according to the International Code of Nomenclature for algae, fungi, and plants, and hence the new names contained in the electronic publication of a PLOS ONE article are effectively published under that Code from the electronic edition alone, so there is no longer any need to provide printed copies. The online version of this work is archived and available from the following digital repositories: PubMed Central, LOCKSS. http:// edocs.fu-berlin.de/docs/content/below/index.xml.

Morphological analyses
The morphological identification of the 70 strains resulted in 37 taxa (see Table 2 and Figs. 3 and 4). 21 taxa were identified by only one strain but 10 taxa were represented by two strains, three taxa by three strains, one taxon by four strains, one taxon by five strains and one taxon by 11 strains.

DNA sequence analyses
PCR and sequencing success for 18S V4 and rbcL was 100% for all strains, resulting in 140 reference sequences for 70 strains. We established 129 novel sequences (INSDC accession numbers KM084866-KM084994) and an additional 11 sequences that had been previously published in Abarca et al. [40] and Zimmermann et al. [19].
There was little molecular variation within the here generated sequence data -only up to 0.5% in 18S V4 (representing 2 bp) and 0.3% in rbcL (corresponding to 3 bp) -between the different strains representing one taxon (Appendix S2). The highest intaxon variation was found in e.g. Mayamaea terrestris 0.53% (18S V4), respectively Navicula cryptocephala e.g. 0.33% (rbcL). The uncorrected p distances for all genera and sequences are given in Appendix S2.
The results from sequence comparison with sequences published in the databases of the International Nucleotide Sequence Database Collaboration (INSDC, includes GenBank, EMBL and DDBJ) are shown in Table 3 and summarised in Fig. 5a, 5b. In the case of 18S V4, 22% of our taxa had entries with identical sequences in the INSDC whereas for rbcL this number was 21% (Fig. 5b). This was the case e.g. for Caloneis silicula and Navicula cryptotenella (Table 3). 22% (18S V4, Fig. 5a) respectively 25% (rbcL, Fig. 5b) of our taxa had no entry in the INSDC databases, e.g. Amphora ovalis and Luticola sparsipunctata (Table 3). For 15% of our taxa an identical 18S V4 sequence (Fig. 5a) with a different taxon name was found in the INSDC databases (e.g. Gomphonema parvulum); the number was considerably lower in rbcL with only 4% (Fig. 5b). The remaining taxa of which many showed sequence dissimilarities of over 15 bp were 41% for 18S V4 (Fig. 5a) and 50% for rbcL (Fig. 5b). The highest difference was found for Pinnularia viridiformis with 97 bp in 18S V4 ( Table 3).
The tree derived from the concatenated data set and calculated by the Neighbour Joining (NJ) algorithm, including only the here presented strains, is shown in Fig. 6; the trees of the individual analysis of both markers are given in the Appendix S2. The molecular clades are congruent between 18S V4 and rbcL, the tree topology is partly differing between both markers (Appendix S3, S4); however, the conflicting nodes have bootstrap values below 0.85 and are therefore neglected.
In the tree derived from the combined dataset, the sampled genera are monophyletic and well supported (.0.98 bootstrap support BS, Fig. 6), except for Caloneis, Craticula and Sellaphora.
Craticula buderi falls into a clade with the genera Stauroneis and Karayevia (0.48 BS; Fig. 6). Sellaphora falls into one group with Eolimna (0.98 BS; Fig. 6). The genus Caloneis is found in two distinct clades: Caloneis silicula is clustering with Pinnularia (0.61 BS; Fig. 6), Caloneis amphisbaena forms an independent clade on its own (1.00 BS; Fig. 6). The deeper bifurcations representing the relationship between the genera are generally not well supported by bootstrap values. All 37 subgeneric taxa included in this study are monophyletic (Fig. 6).
The trees for the genus Amphora including all available data from INSDC databases (this includes also accessions from the genus Halamphora) are shown in Fig. 7a (18S V4) and Fig. 7b (rbcL). The Amphora ovalis strains (Amph1, Amph4, Amph5, D45_003 and TeAm01) form a monophyletic clade, that is well supported in both 18S (0.99 BS) and rbcL (0.97 BS). The strain HSB02, identified as Amphora berolinensis appears to be rather isolated within the Amphora tree, except for an affiliation with the unidentified strain C10 (INSDC accession number FJ002132) in the rbcL tree (0.89 BS; Fig. 7b). All strains identified as Amphora pediculus cluster in one clade in 18S V4 (0.90 BS; Fig. 7a) and rbcL (Fig. 7b). This includes also the strain D54_002 named Amphora sp. aff. atomoides. The tree derived from rbcL sequences also includes the strain AT-21.206 (INSDC accession number AN502022) identified as Amphora cf. fogediana (Fig. 7b), which forms a branch with strain s0992 named Amphora copulata (INSDC accession number AB754831) in 18S V4 adjacent to the Amphora pediculus clade (Fig. 7a). In respect to the other strains available from the INSDC databases there is no topology consistent with the taxonomic identifications found in the trees (Fig. 7a, 7b). Several taxa, including the species Amphora coffeaeformis, Amphora normannii and Amphora montana were     recently transferred to the genus Halamphora [37]; these taxa and also the two INSDC accessions listed as Halamphora in the (numbers AB754832, AB754833;) are forming a loose cluster in the upper part of the 18S V4 tree (Fig. 7a). The rbcL data set supports an independent clade for the taxa of the genus Halamphora (Amphora coffeaeformis, Amphora normannii, Am-   Fig. 7b). However, within the Halamphora clade the strains identified as Amphora coffaeaformis are not monophyletic (Fig. 7b).

Nomenclatural and taxonomical consequences
Two new taxa were first discovered by morphological means namely Amphora berolinensis and Stauroneis schmidiae. The analysis of molecular data suggested the existence of two more previously undetected taxa that could later be also morphologi-cally confirmed (Mayamaea terrestris, Planothidium caputium). For yet another two taxa morphological data is incomplete (teratological outline, micro-morphological data missing) but the molecular data show that they both are different from an identified taxon in this genus; these strains are named sp. (Amphora sp. aff. atomoides); in one case we used the term cf. (Amphora cf. pediculus) to show that it is closely related to a known taxon.

Amphora cf. pediculus
The strains D03_063 & D03_082 are morphologically very similar to our Amphora pediculus D03_074 but have double areolae in each ventral stria and not only a single elongated areola like A. pediculus. The specimens of these strains have a similar valve outline as A. indistincta, but in SEM the differences are more distinct because in A. indistincta the width of the central and dorsal side is almost equal and the striae are composed of elongated areolae.

Amphora sp. aff. atomoides Levkov
The strain D54_002 has a valve semi elliptical with arched dorsal margin, concave ventral margin and narrowly rounded valve ends. Valve length is 10-12.4 mm, breadth 4.6-5 mm. The central area on dorsal side is a rectangular fascia almost extending to the dorsal margin; on the ventral side the much broader fascia is expanding towards the valve margin. Raphe branches linear, filiform. Proximal raphe endings straight, distal raphe endings ventrally deflected. Dorsal striae radiate throughout, 16 in 10 mm.
This species closely resembles A. atomoides but differences can be observed in the shape of the central area and valve breadth (7-11 mm in A. atomoides). In A. atomoides the central area on the dorsal side is small or absent not extending to the valve margin, contrary to our Amphora sp. aff. atomoides where the central area presents a rectangular fascia almost extending to the dorsal margin. D54_002 also resembles A. pediculus with respect to its valve shape and size. However D54_002 can be differentiated by the valve width (A. pediculus is narrower with 2.5-4 mm) the central area (A pediculus has a distal raphe dorsally deflected and a central area with a rectangular facia, extended to the dorsal valve margin) and the stria density (A. pediculus has more striae 18-24/ 10 mm). D54_002 can also be differentiated from A. minutissima by the shape of valve apices (ventrally bent in A. minutissima). Additional observations of more specimens by SEM would be necessary to establish the proper identity of this population from Heiligensee, Berlin.
Amphora berolinensis differs from A. copulata (Kützing) Schoeman & Archibald because the latter has bigger valves (19-42 mm length, 5-7.5 mm breadth). In SEM the differences are more distinct. Differences can be observed in the shape of the central area (bordered by striae close to the valve margin in A. copulata), the raphe (biarcuate in A. copulata) and the morphology of the dorsal striae (crossed by longitudinal bars in A. copulata). A. berolinense also differs from A. neglectiformis Levkov & Edlund by the larger valves of the later (18-53 mm length, 5-7 mm breadth) and the ventral striae which are composed of two areolae in A. neglictiformis near the valve ends.
The valves of Amphora berolinensis are semi-lanceolate to semielliptical, with smoothly arched dorsal margin and straight to slightly concave ventral margin, valve ends rounded. Valve length is 9.5-18.9 mm, breadth 4.7-5.2 mm. Axial area is narrow, slightly arched. The central area on the dorsal side has a rectangular fascia extending to the dorsal margin; on the ventral side the fascia is wider expanding towards the valve margin. Raphe is filiform and more or less straight, in some valves the proximal raphe endings are straight, in others they are dorsally bent and the distal raphe endings are straight and in some valves they are ventrally bent. Dorsal striae are coarsely punctated and radiate throughout, 12-14 in 10 mm. Ventrally striae are radiate, composed of one areola.
The valves of Mayamaea terrestris are narrow linear-elipical, ends obtusely rounded. Valve length is 7-8.7 mm, breadth 3-4.5 mm. Striae are radiate throughout, 22-24 (-26) in 10 mm with c. 50 areolae in 10 mm. Raphe is filiform, the two branches are gently arcuate with distinct central pores. Axial area is slightly broad, widening lanceolately towards the middle of the valve. Central raphe ends expanded by depressions around the central pores and deflected, while the ends of the terminal raphe fissures are deflected to the opposite side.
This new species lives in soil; this is signified by the epithet name.
10 further strains ( have only low sequence differences for 18S V4 and rbcL (Appendix S2) and form a clade clearly different from all the other available Mayamaea strains (Fig. 7c, 7d).
Morphologically, Planothidium caputium has a similar outline as Planothidium lanceolatum but differs from it by a hood over the depression on the rapheless valve as in P. frequentissimum. The difference to P. frequentissimum lies in the form and size of the hood; which is bigger, longer and wider in P. caputium than in P. frequentissimum and the hood has a wider opening; this results in a line-like instead of a horse shoe appearance when focusing through the hood. The uncorrected p-distances show that Planothidium caputium sequences differ at least 2.4% (18S V4) respectively 2% (rbcL) from Planothidium frequentissimum, and 6% (18S V4) respectively 4% (rbcL) from Planothidium lanceolatum (Appendix S2), this is also represented in the trees including all available Planothidium strains (Fig. 8a, 8b).
Valves are elliptical to elliptic-lanceolate, with rounded apices. Valve length is 20-22.9 mm, breadth 5.5-6.4 mm. The striae are radiate on both valves, becoming more radiate towards the apices, with 13-14 in 10 mm. Striae are multiseriate with three to five rows of areolae per stria. The axial area is narrow and linear to lanceolate in both valves. A weak central area on the raphe valve and a horseshoe-shaped collar on one side of the rapheless valve which by focusing in LM another line less arched can be recognized (see also Straub 1990 [49]).
Also strain D06_113 belongs to this species.
Morphologically, Stauroneis schmidiae differs from Stauroneis borrichii (Petersen) Lund, which has a similar valve outline but with protracted ends, because the latter is shorter and more slender and has more striae  Valves are linear-lanceolate with very slightly rounded nonprotracted ends. Valve length is 27-28.2 mm, breadth 5.5-6 mm.
Striae are radiate throughout the entire valve, 15-18 in 10 mm. Puncta of the striae are discernible in LM and are 24-28 in 10 mm. Pseudosepta present.
Also strain D28_002 belongs to this species. Figure 5. Chart giving classes of base pair (bp) differences for both markers (18S V4, rbcL) between here presented molecular data and corresponding data from INSDC databases. Inferred from data in Table 3 Compared to the other available Stauroneis strains Stauroneis schmidiae clusters independently for both markers (Fig. 8c, 8d).
This species is named in honor of Prof. Dr. AnnaMaria Schmid who was an inspiring diatom teacher to Regine Jahn.

Discussion
The 37 naviculoid diatom taxa, of which reference barcodes are published here, represent only about 7% of the total diatom flora which is 14% of the naviculoid taxa recorded for Berlin waters (539 taxa, see [31]). Nevertheless, it is a first milestone in characterising diatoms not only by morphological but also by molecular means, which represents the start of a taxonomic reference library for diatoms.
Identification via DNA sequences is an important tool, especially in microorganisms. Many of the large scale environmental DNA barcoding studies in protists so far rely on higher taxonomic levels of families and above; only rarely they reach a  resolution at genus level. In diatoms, assignment to genus level is unproblematic [51,52]. Even identification to the species level is possible, but strongly depends on the quality of the reference database [52][53][54][55][56][57][58]. We here tested the taxonomic consistency of naviculoid diatom taxa at the species level by comparing our identified sequences with the published sequences in the repositories of the INSDC. We found that the taxonomic assignment in INSDC is currently unsatisfying, because it is often erroneous. In the data of the two commonly used DNA barcoding markers for diatoms 18S V4 and rbcL we analysed, we found that for rbcL 26% for the sequences listed under the same name as our strains more than 15 bp sequence difference were recorded (Fig. 5b); for 18S V4 this was 12% (Fig. 5a). For the 800 bp long rbcL fragment 15 bp difference amounts to roughly 2% sequence difference, in the shorter (400 bp) 18S V4 fragment 15 bp difference correlates to even 4%. The relatively high percentage of differences in these short DNA fragments suggests that the sequences belong to a different taxon. This implies morphology-related misidentification, mislabelling or cross-contamination. There are an additional 16% (rbcL, Fig. 5b) respectively 5% (18S V4, Fig. 5a) of the sequences where sequences with the same taxon name showed differences between 6 and 15 bp, here it is unclear whether these strains belong to a different taxon of a closely related cryptic species or whether they reflect natural intraspecific variation. Furthermore, we found that in 4% (rbcL, Fig. 5b) respectively 15% (18S V4, Fig. 5a) of the cases, identical sequences in the repositories of the INSDC were annotated with a different taxon name than the strains of this study. These sequences therefore provide an erroneous identification. In summary, the unevaluated use of information deposited in the INSDC leads to wrong identifications in at least 30% of the cases; in only about 20% of our cases, the identifications coincided unambiguously.
Unfortunately, in most cases it is not possible to trace the DNA sequence to the specimen from which it originated and, because of lacking voucher specimens, taxonomic evaluation is not possible; hence there are no means to verify whether a faulty taxon assignment had occurred or an interesting biological phenomenon. Therefore such sequences are of no future use and valuable information is lost to science. Assessment of diatom community composition through environmental DNA barcoding could greatly benefit from better documented reference libraries, especially because biodiversity in general should be evaluated at least on the species level [59].
Furthermore, the linkage between historically and morphologically described taxa and molecular sequences is not very strong. A possible threat is that two independent data clouds might develop [60]: one including large amounts of molecular data from environmental sequencing, the other species specific data (e.g. paleontological and recent distribution, ecology, phylogeny) linked to morphological descriptions. For organism groups where next to no morphology based data exist (e.g. many groups of bacteria), there is little harm if the information in the two clouds cannot be correlated. However, in groups like diatoms, where two centuries of data collection linked to morphologically described species exists, it would be a waste of painfully acquired data not to link these two groups of data. At the moment, this link would be a reference sequence that is connected to a morphological voucher (and DNA sample) deposited in a natural history collection and therefore available for multiple testing and verification of results as well as for long-term studies.
We here define a taxonomic reference library as an entity combining molecular data -in our case DNA sequence data of two markers -with morphological documentation of important features as well as a valid name. Also environmental information on the collecting site should be provided in a standardised format.
Documentation should also include the deposition of DNA in a curated repository. To ensure traceability of a name/sequence back to the specimen it originated from, morphological details important for identification should be provided in an online photographic documentation, this includes high-resolution photographs giving an overview of the cell as well as details produced by electron microscopy or comparable techniques. Another special aspect for diatoms (and some other microorganism groups) is that many sequences derive from cultured clonal strains, especially if they are linked to morphological entities. Therefore, the strain number and other strain specifications are valuable information that should be presented along with the sequence.
Ideally, all the necessary information for traceable taxonomic classification should be available in a single data portal; however, at the moment there are several technological limitations to deposit and/or respectively retrieve all the information in and from one location. The Consortium for the Barcode of Life (CBOL) aims at compiling DNA barcode records in a public library (Barcoding of Life Database BOLD) [53] and even designed a Barcode Submission Tool for submitting sequences to the INSDC databases. However, this tool is limited to one marker, namely the mitochondrial cytochrome oxidase subunit I (COI) e.g. [17,[61][62][63][64][65]. For many groups, e.g. plants [66] but also diatoms, this barcoding marker is not routinely applicable [19,[21][22][23][24][25], albeit there are BOLD supported activities to implement alternative solutions for some organism groups e.g. [28]. On the other hand, the Barcode Submission Tool provides possibilities to at least upload a pherogram (output of sanger sequencing), but no pictures of the organisms can be stored. Therefore, this tool does not require a link to a morphological voucher (digital and physical), which would allow for subsequent taxonomic validation. Also a link to a herbarium specimen is only indirectly possible if the accession number of the specimen collection is given and the respective collection has their specimen picture online available. Although, it seems generally possible to deposit pictures and other data along with the DNA sequence in BOLD [53], unfortunately, the data deposited within BOLD is often not open access, depending on the rights given by the administrator. Also, we heard reports that data is not released to the public even if requested by the author. In conclusion it would be preferable if INSDC would extend their service, as they are the most commonly used platform to deposit sequence data [58].
Here we present our strategy on how documentation can be performed to build a comprehensive reference database for diatoms even with inconvenient IT possibilities. The here presented materials and data have been documented as follows: The physical vouchers (microscopic slides and SEM stubs) have been deposited in the Berlin Herbarium (B), the DNA in the DNA bank network of the Botanic Garden and Botanical Museum Berlin-Dahlem [39]. The data for both items are made available through The Global Genome Biodiversity Network (GGBN [67]) and The Global Biodiversity Information Facility (GBIF [68]). The sequences have been submitted to an INSDC database (EMBL) along with strain numbers, voucher number from the Berlin Herbarium (B) and DNA bank number. Also primer details and geo-references have been deposited there. Photographic documentation is online available from the AlgaTerra Information System [30], linked through INSDC accession number and accession number from the Berlin Herbarium. Morphological characters, cultivation details as well as sampling data of the collecting sites beyond the geo-references (e.g. ecological specifications) have also been deposited in the AlgaTerra Information System [30].
A carefully documented reference sequence could be considered as something similar to a molecular type of the name of a species. Biological taxon types should be documented with a maximum amount of data, which makes it possible for every researcher to determine whether a specific specimen belongs to the concept of the designated type. In the botanical [69] and zoological [70] codes of nomenclature the basis for species description is the deposition of physical specimen. A reference sequence or reference barcode should be similarly well documented.
Biological identification systems are in constant development, therefore a continuous process of confirmation, validation and updating in relation to alpha taxonomy is required to build a compressive and accurate reference library. Protocols for data curation and revision are indispensable for new species discovery as well as taxonomic revisions. Therefore, entries in a taxonomic reference library (e.g. in an extended INSDC like system) need to be curated and updated in order to be in line with current taxonomy. However, a huge impediment for data curation by the respective author -once it is submitted -is, that there is no reward system for researchers for curating their data [71]. It has been shown, that incentives for researchers for the publication of thoroughly documented datasets similar to the publication of the conclusions drawn from these could greatly increase the motivation to publish datasets [71]. Another approach would be that data curation would be carried out by professional personnel employed for this purpose or a combination of both approaches.
Not only DNA barcoding approaches would benefit from well documented and referenced molecular data but also taxonomic and phylogenetic studies of diatoms which could integrate published data more efficiently if better documentation linked to physical objects were available [72]. For example, the clusters found for the genus Mayamaea, based on available 18S V4 and rbcL sequences, show low taxonomic consistency (Fig. 7c, 7d). The INSDC data suggest that there are different groups of Mayamaea (atomus var.) permitis, and within the Mayamaea atomus (var. atomus) sequences is one sequence named Mayamaea fossalis var. fossalis (Fig. 7c, 7d, black and red). For two of the AT strains included in the Mayamaea analysis additional data is available from the AlgaTerra Information System [30] (Fig. 7c, 7d, green): (a) more taxonomic detail is given than deposited alongside the sequence in INSDC -strain AT-115Gel07 is identified as Mayamaea atomus var. atomus and AT-101Gel04 as Mayamaea atomus var. permitis -and (b) photographs with morphological details are provided. Therefore the identification of both strains could be checked and verified. Even though additional data for only two strains is available from the AlgaTerra Information System [30], this already aids in the interpretation of the trees given in Fig. 7c and 7d; especially for the tree based on 18S V4. There is a cluster of Mayamaea permitis (Syn. Mayamaea atomus var. permitis), incl. strain AT-101Gel04, and one strain (AT-115Gel07) belonging to Mayamaea atomus var. atomus (Fig. 7c,  green). As Mayamaea permitis (Syn. Mayamaea atomus var. permitis) has been raised to species rank due to morphological reasons (see above), this allows the interpretation that Mayamaea fossalis could be an independent taxon (Fig. 7c, green). For the tree based on rbcL, however, only an informed guess can be made: for two strains, namely (Wes2)f and AT-199Gel01, no additional data is available to check the identification (Fig. 7d). If it could be assumed that (Wes2)f was misidentified and AT-199Gel01 belongs to Mayamaea permitis, again four independent taxa could be assumed: Mayamaea atomus, Mayamaea fossalis, Mayamaea permitis and Mayamaea terrestris. This example, particularly the different interpretation possibilities between 18S V4 and rbcL trees, clearly shows how valuable additional data can be for the interpretation of sequence based analyses.
Due to the fact that species descriptions in diatoms are based on morphology derived from microscopic pictures (of variable quality) of single, or a limited number, of valves from a presumed population in mixed samples, it is often difficult to unambiguously identify a strain. Even within a single clonal culture, morphological variation sometimes fits in parts to different species circumscriptions [45]. In addition, size wise clonal cultures are often at the lower end of the morphometrics of a taxon description; if cultured for too long and if no auxosporulation has taken place, diatom valves tend to lose their typical morphological features because they get smaller with each cell division. This leads to the problem how to link sequences derived from cultures to a type specimen or at least to a current species concept. If a type specimen is designated, this can be achieved e.g. through epitypification as has been done for Cocconeis pediculus and C. placentula [73,74]. But in most cases, this will be done in the context of a taxonomic revision of a species group as e.g. for Gomphonema saprophilum [45] and needs to be done for the two unidentified Pinnularia species of this study. For the purpose of a reference library, if no unambiguous identification seems possible, the sequence could either be designated as belonging to a certain ''formenkreis'' (taxon group) marked as affine (e.g. Amphora sp. aff. atomoides), as not exactly fitting the original descriptions marked as confer (e.g. Amphora cf. pediculus) [http://bionomenclature-glossary.gbif. org/], or a new taxon has to be described formally along with providing the reference sequence (e.g. Amphora berolinense). The first two options are a practical way to make re-users of the data aware of an ''uncertainty level'' concerning the taxonomic identification; this is better than providing no guidance to the species group by giving just the genus name such as Amphora sp. As we documented in this study, the marine or halophilic species of Amphora sensu lato have been recently moved into the genus Halamphora; for a freshwater reference library, this is important ecological data. In addition, this information might become valuable for the interpretation of taxonomic discrepancies

Conclusions
As here shown exemplarily for some naviculoid diatoms, taxonomic reference libraries could serve as an online accessible and algorithmically searchable equivalent to commonly used printed identification literature. They are needed to link molecular based identification technologies with correct organism references. However, up to now searchable data bases often include large percentages of wrongly annotated sequencesand provide no possibility to trace the identification back to the respective specimen, leaving molecular based techniques often with identifications only to family or genus level. While for some studies this level of taxonomic depth seems to suffice (e.g. large scale biodiversity assessments), there are many studies that could profit from well documented molecular data (e.g. species inventories, monitoring, taxonomy, phylogeny). Therefore, it would be worth