Development of microsatellite markers and assembly of the plastid genome in Cistanthe longiscapa (Montiaceae) based on low-coverage whole genome sequencing

Cistanthe longiscapa is an endemic annual herb and characteristic element of the Chilean Atacama Desert. Principal threats are the destruction of its seed deposits by human activities and reduced germination rates due to the decreasing occurrence of precipitation events. To enable population genetic and phylogeographic analyses in this species we performed paired-end shotgun sequencing (2x100 bp) of genomic DNA on the Illumina HiSeq platform and identified microsatellite (SSR) loci in the resulting sequences. From 29 million quality-filtered read pairs we obtained 549,174 contigs (average length 614 bp; N50 = 904). Searching for SSRs revealed 10,336 loci with microsatellite motifs. Initially, we designed primers for 96 loci, which were tested for PCR amplification on three C. longiscapa individuals. Successfully amplifying loci were further tested on eight individuals to screen for length variation in the resulting amplicons, and the alleles were exemplarily sequenced to infer the basis for the observed length variation. Finally we arrived at 26 validated SSR loci for population studies in C. longiscapa, which resulted in 146 bi-allelic SSR markers in our test sample of eight individuals. The genomic sequences were also used to assemble the plastid genome of C. longiscapa, which provides an additional set of maternally inherited genetic markers.


Introduction
Microsatellites, a special category of simple sequence repeats (SSRs), are tandemly repeated motifs of one to eight bases that occur in the nuclear genomes of all eukaryotes tested up to now and are, at least for species with larger genomes, quite abundant and evenly dispersed throughout the genome [1,2]. SSR loci are usually characterized by a high degree of length polymorphisms, and are ideal co-dominant markers for population studies. For the analysis of SSRs, specific loci are normally PCR amplified and screened for length differences among individuals. This PCR step constitutes a major drawback of SSR markers, as it involves primers a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 specific for the loci of interest in the species under study. Thus, de novo development of SSR markers is necessary for all newly studied taxa. For the Montiaceae species (former Portulacaceae, see [3][4][5][6]) Cistanthe longiscapa or close relatives up to now no SSR markers are available.
Cistanthe longiscapa or "Pata de Guanaco" is an endemic annual of the Atacama Desert, distributed between 25˚and 31˚S from coastal habitats up to 3800 m elevation in the Andes [7]. It prefers well-drained, sandy soils [8]. After rare winter rainfall C. longiscapa contributes with its purple-pink flowers and massive bloom in isolated patches to the flowering desert phenomenon. Field observations indicate on-going habitat destruction by off-road activities and tourism especially during flowering times. However, the species seems currently not threatened [9,10].
To enable population genetic analyses in C. longiscapa we developed SSR markers for this species. By taking advantage of second-generation sequencing technology that easily provides genome-wide sequence information of a species of interest we could avoid the tedious enrichment and cloning steps of traditional SSR development [11]. Therefore, we performed low-coverage shotgun sequencing of the C. longiscapa whole genome on the Illumina HiSeq platform and identified SSR loci by searching in scaffolds for SSR motifs. For candidate loci PCR primers were designed and afterwards tested by PCR amplification and screening for fragment length polymorphisms among different Cistanthe individuals. This resulted in a set of 26 variable SSR loci. Moreover, the sequences provided enough reads derived from the chloroplast to assemble also the entire plastid genome that provides additional maternally inherited marker loci.

Plant material
Plant tissue (whole individuals, fresh 5-20 g each) of Cistanthe longiscapa (BARNÉOUD) CAROLIN EX HERSHKOVITZ was collected in natural stands of the species in the Atacama Desert (Table 1). As the species is not endangered or protected, no permission was required for its collection at any of the locations sampled in this study; in Chile, no statutory framework governing the collection of unlisted wild plants is currently in force. Plant material was frozen, and stored at -20˚C till DNA extraction.

DNA isolation
Cistanthe longiscapa is characterized by high amounts of mucilage within its succulent leaves. To reduce the concentration of these polysaccharides, total genomic DNA was extracted with

DNA library preparation and sequencing
As the obtained DNA amounts per individual were rather low, we combined DNAs of two C. longiscapa individuals ( Table 1) for construction of the sequencing library. In total 0.166 μg genomic DNA was used as input material for the following steps. Library preparation was carried out as described by Meyer and Kirchner [13]. Briefly, DNA was covarized to generate fragments of on average 300-400 base pair (bp) length followed by adaptor and barcode ligation. The library was size-selected with a SYBR Gold stained electrophoresis gel. Fragment size distribution and DNA concentration were evaluated on an Agilent BioAnalyzer High Sensitivity DNA Chip and using the Qubit DNA Assay Kit in a Qubit 2.0 Flurometer (Life Technologies). Finally the DNA concentration of the library was checked by a quantitative PCR run. Cluster generation on Illumina cBot and paired-end sequencing (2x100 bp) on the Illumina HiSeq 2000 platform followed Illumina's recommendation and included 1% Illumina PhiX library as internal control.

Data filtering and de novo assembly
As only about one fifth of a HiSeq lane was used to generate the C. longiscapa sequences, sequence reads were initially sorted according to their barcodes to separate them from the other materials sequenced in parallel. Before genome assembly the 92 million obtained raw sequencing reads (46 million pairs) were quality checked and over-represented, i.e. clonal reads were detected with FASTQC [14]. Quality trimming (minimum length of 75 bp and Phred score of at least 15) and adaptor sequence removing was done in CUTADAPT V0.11.1 [15].
De novo genome assembly of the 29 million quality-filtered read pairs was performed by CLC v4.3.0 (CLC bio) followed by scaffolding with SSPACE v3.0 [16] to improve the assembly. NCBI BLAST searches were used to check for bacterial contaminations in the sequence reads. Plastid scaffolds were identified using BLAST searches. Scaffolds were mapped against the plastid chromosome of Beta vulgaris (GenBank accession number KR230391) and Haloxylon persicum (KF534479). Gaps were filled using GAPFILLER v1.10 [17]. Proper pairing of reads was checked by mapping the original reads against the obtained C. longiscapa plastid chromosome using BOWTIE2 v2.2.4 [18] and manual examining by visualization using SAMTOOLS v1.2 [19]. Additional Sanger sequencing was performed for the IR/SSU boundaries (see below). Therefore, universal angiosperm primers were designed using different available angiosperm chloroplast genomes in combination with already established primers (compare Table 2). In addition, minor sequence parts of two genes (ycf2 and trnA) that did not pass the quality checks after mapping where confirmed by Sanger sequencing (compare Table 2). Amplification of the plastid regions were performed in a 25 μL reaction volume containing 0.75 U DNA polymerase, (GoTaq, Promega), 1 x buffer, 0.2 mM of each dNTP, 2.5 mM MgCl 2 , 20 pmol of each amplification primer, and about 10 ng of total DNA. The amplifications were performed in a Mastercycler Pro (Eppendorf) with the following PCR protocol: 2 min initial denaturation at 94˚C and 35 cycles of 30 s at 95˚C, 60 s at 52˚C, 60 s at 72˚C, followed by a final extension for 10 min at 72˚C. Amplified products were gel cleaned using spin filter columns (NucleoSpin Gel, Macherey-Nagel) following the manufacturer's protocol and sequenced by GATC Biotech (www.gatc-biotech.com). Annotation was performed using CPGAVAS [20] and edited manually guided by the Lindenbergia philippensis (NC022859) annotation [21]. The map of the plastid chromosome was generated using GENOMEVX [22] Search for SSR loci and primer design Potential SSR markers were detected in the scaffolds of the assembled nuclear genome using the MISA tool [25]. We searched for SSRs with motifs ranging from di-to hexa-nucleotides. The minimum number of repeat units was set as following: ten for di-, eight for tri-, seven for tetra-, six for penta-and five for hexa-nucleotide motifs. Raw reads for the SSR contigs and proper pairing of reads was cross checked aligning the raw reads to the SSR loci using BOWTIE2 v2.2.4 [18] and manual examining by visualization using SAMTOOLS v1.2 [19]. Primer pairs were designed using PRIMER3 [26] with default parameters.

Assessment of SSR polymorphisms
We selected 96 loci (74 tri-, 19 tetra-, and 3 penta-nucleotide repeat motifs) from the nucleus for which primer pairs were synthesized (SigmaAldrich). Our selection criteria were (i) high motif repeat number per locus to increase the chance to find variation, (ii) avoiding repeat motifs that tend to form strong secondary structures (e.g., GC/CG, TAA/ATT) and are therefore hard to amplify and/or score, (iii) long enough and suitable regions flanking the SSR motifs to place PCR primers in, and (iv) low to medium read coverage in the sequences of the selected loci to avoid targeting SSRs within repetitive regions of the genome. Amplification success was tested on three C. longiscapa individuals, of which two individuals came from the same and one from a geographically widely separated population. Polymerase chain reactions (PCR) were carried out in 25 μL reaction volumes containing 1 U Taq DNA polymerase (GoTaq, Promega), 0.2 mM of each dNTP, 1 x buffer, 2.0 mM MgCl 2 and 10 pmol of each amplification primer. The amplification profile was composed as follows: 2 min at 94˚C, 35 cycles of 30 s at 94˚C, 60 s at 55˚C, 60 s at 72˚C, followed by a final extension step of 10 min at 72˚C.
For loci passing this test, products were subsequently cloned using the pGEM-T Easy Kit (Promega), and clones were sequenced at Macrogen (Korea). Lengths and sequences of 49% of the alleles were confirmed through cloning and sequencing. Among the 73 sequenced loci an allelic diversity of 23% was detected. Quality of the sequence reads was assessed using PHYDE [27], and sequences were aligned with the vector and primer sequences being removed. This Table 2. Primers for the validation of the IR boundaries. 'ycf1' = pseudogene.

Region
Code O Primer Sequence enabled us to detect polymorphic SSR markers and to directly infer the allelic state of them, i.e. to see if the SSR motif causes the size difference or if length mutations in the flanking regions are present. Only for loci with length variability fluorescent-labeled forward primers were ordered and fragment length polymorphisms among eight C. longiscapa genotypes from six populations (Table 1) evaluated using the service provided by Macrogen (Korea). Fragment data were analyzed in GENEIOUS R9 [28], the statistical analyses of allelic diversity were performed using GENALEX [29].

Illumina shotgun sequencing
Paired-end sequencing yielded about 46 million reads from either end of the DNA fragments, resulting in a total amount of about 9 Gb. After quality assessment and data filtering, 29 million clean read pairs were classified as high quality reads (63%) and used for further analysis.

Genome assembly
The high-quality reads of lengths between 80 and 100 bp were used for genome assembly. The assembled plastid genome of C. longiscapa contained contigs, which were generally present with very high sequence coverage (500-2000-fold). The length of the genome is 156,824 bp, consisting of 86,715 bp for the large single-copy region, 18,363 bp for the small single-copy region, and twice 25,873 bp of sequences belonging to the two inverted-repeat (IR) regions (compare Table 3). IR boundaries and questionable positions in ycf2 and trnA could be validated by Sanger sequencing as outlined above. Two polymorphic sites were detected in the non-coding parts of the SSC, one in the rpl23-trnL intergenic spacer (IGS), the other one in the ndhA intron. The chromosomal architecture mirrors the typical structure found in angiosperms [30], with the exception of group II intron loss in rpl2. In total we found 79 protein coding genes, 30 tRNAs and 4 rRNAs. The genome annotation is shown in Fig 1, the sequence is available from GenBank through accession number KX928992.

Identification of SSR loci
After MISA analysis, a total of 10,336 potential SSR loci fitting our search criteria were identified in the nuclear genome. Of these, the most frequently detected SSR sequences consisted of di-nucleotide repeats with 80.3% of abundance, tri-nucleotide repeats (13.2%), followed by hexa-(2.4%), tetra-(1.2%), and penta-nucleotide repeats (0.6%). Compound SSRs, i.e. two SSR stretches in close proximity, were found in 2.3% of all cases. Repeat numbers of scored SSR motifs were in a range from five (lower cut-off) to 31. PCR amplification primers for the nucleotide repeat loci were tested through PCR amplifications in three individuals. Of the 35 reliably amplifying loci, 26 showed length variation in the test sets of eight C. longiscapa individuals. Primer sequences together with range parameters and detected allele numbers for these loci are provided in Table 4. In contrast to the abundant nuclear microsatellites, the plastome offered only homonucleotide regions.

Sequencing of the cloned SSR alleles
Cloning and sequencing of the 35 reliably amplified loci confirmed the contigs retrieved from the initial shotgun sequencing. Although the expected SSR motifs were present in all loci, six loci (17%) were excluded from the marker panel due to length mutations adjacent to the SSR motif. For two loci only a small fraction of the sequenced clones were similar to the initial shotgun sequences. They were therefore excluded, as PCR primers seemed not to be locus specific. Finally, one locus was excluded due to an imperfect SSR pattern, leaving 26 loci for the microsatellite test run with the following proportion of nucleotide repeats: 61.5% tri-, 26.9% tetra-, and 7.7% penta-nucleotide repeats plus 3.8% compound (di-plus tetra-nucleotide repeat) motifs. GenBank sequence accession numbers are given in Table 4. According to the PCR amplicon size range of the selected microsatellite loci we propose a multiplexing strategy based on five pools using 6FAM/VIC/NED/PET as standard dyes in fluorescent detection, as indicated in Table 4.

Discussion
To arrive at a set of SSR markers for population studies in C. longiscapa we used sequences derived from a shotgun sequencing approach of genomic DNA. We did not enrich the sequencing library for SSR motifs (e.g., [31]) prior to sequencing, as the high amount of SSR loci in plant genomes should result, even in untreated genomic DNA, in a sufficient number of useful regions to successfully develop SSR markers. Moreover, this speeds up the development procedure, and the comparatively low costs for next-generation sequencing easily compensates the reduced ratio of SSR loci vs. non-SSR regions within the resulting sequences. We here used mainly loci derived from scaffolds with 3x-10x sequence coverage. We avoided including SSR loci derived from contigs and scaffolds with very high sequencing coverage (>15x), as they likely stem from repetitive parts of the genome. They bear the risk to include SSR loci that occur as paralogs with multiple copies within a genome. Accordingly, the final set of our SSR loci did not produce fragment patterns indicative of multilocus data, i.e. amplicons resulting from the presence of two or more paralogous loci detected by a single PCR primer pair. The DNA sequence of scaffolds with such low sequencing coverage might include some uncertain sequence positions, which could result in a lower number of reliably amplifying SSRs during the test of the 96 initially chosen loci due to primer mismatches. However, the initial PCR step in our development procedure selects against such loci, as unreliably amplifying loci were not included in the next steps of locus evaluation. For us, initial omission of potentially multicopy loci, which might produce analysis problems in a later stage of the project, seems preferable over retaining a higher number of SSR loci throughout the early stages of SSR development.
The interpretation of microsatellite data faces different kinds of errors [32]. These are due to (i) parallel mutations in the SSR motif in different individuals resulting independently in the same fragment size, (ii) length mutations in the flanking region of the SSR motif that influence fragment lengths of some individuals but are thought to stem from repeat number differences of the SSR, and (iii) base substitutions creating new alleles without changing fragment lengths. While it is not possible to detect errors of the first class, sequencing of the allelic diversity found in a species can possibly reduce problems derived from the other two. Thus, we sequenced SSR alleles during the validation of the SSR loci and excluded about one quarter of the reliably amplifying loci (9 out of 35) from our marker set. This can, of course, not prevent that, with additional individuals included, some of these errors might occur. It helped, however, that obviously problematic loci were excluded from further analyses and should provide an overall better set of SSR markers. Although 26 SSR loci (27% of the initially selected 96 loci) were validated here, we explored only a very small fraction of potentially variable SSR sites in the genome of C. longiscapa. Whole genome shotgun sequencing, even when performed with rather low genomic coverage (here using~20% of a HiSeq lane), detected more than 10,000 potentially useful SSR loci. With less stringent selection conditions (looking also for lower repeat numbers within SSR motifs) even more than 95,000 SSR loci were reported [33]. Thus, by using next-generation sequencing for the detection of SSR loci it now is possible to scale the amount of available SSR markers in a wide range, depending on the questions at hand. For population genetic analyses in C. longiscapa, 26 loci seem to provide a good resource to infer differentiation among populations. In case of genomic mapping studies a much higher amount of variable marker would be necessary to densely cover a genome but could easily be derived from the initial set of thousands of SSR motifs found in the genome of the species. The now much faster, easier and cheaper procedure of SSR development in comparison to traditional sequence enrichment and cloning approaches [11] allows to easily design variable genetic markers for nearly every species of interest. De novo development of SSR loci should also be superior to marker transfer from closely related species, as this often result in the use of suboptimal, i.e. less variable loci in comparison to specifically designed SSRs [34]. Genomic shotgun sequencing resulted, in addition to the nuclear SSR loci, in the completely assembled plastid genome of C. longiscapa (Fig 1). This genome contains 86 regions of mononucleotide repeats (10-23 A/T repeats), representing the simplest class of microsatellite motifs. As we used two individuals of C. longiscapa to construct the sequencing library, we found some positions in the plastid sequence, which were inferred to already represent polymorphic characters like in the rpl23-trnL IGS or the ndhA intron. These parts of the genome can be used to evaluate maternally inherited genomic diversity for phylogeographic studies (e.g., [35]) in the species. In our case, both individuals used for sequencing were derived from the same population. Using instead individuals from geographically distant parts of the distribution area might even increase the number of detectable polymorphic sites within the plastid genome.