Ultra-high-throughput DArTseq-based silicoDArT and SNP markers for genomic studies in macadamia

Macadamia (Macadamia integrifolia, M. tetraphylla and hybrids) is an Australian native nut crop and has a significant economic value in the food industries worldwide. Long juvenility along with traditional breeding strategies impede quick genetic improvement of this crop. The existing cultivars constitute only second to fourth generation of the wild germplasm in the rainforest. The utilisation of molecular markers for genomic selection and genome-wide association studies may accelerate genetic gains. Identification of a robust, reproducible, and cost-effective marker system is instrumental in increasing the efficiency of genomic studies. This study is the first to report the potential of two ultra-high-throughput diversity array technology (DArT) markers (silicoDArT and SNP) in macadamia. Both markers were used to identify the genetic diversity and population structure in 80 macadamia cultivars. Parentage analysis of 25 scions in a rootstock trial was conducted to confirm plant identity where recorded identities did not corroborate with phenotypic field observations. A total of 22,280 silicoDArT and 7,332 SNP markers were reported, of which 11,526 silicoDArT and 3,956 SNP markers were used for analyses after screening with quality control parameters including >95% call rate, >95% reproducibility, and >0.05 one ratio. The average polymorphic information content (PIC) values of silicoDArT and SNP markers were 0.29 and 0.21, respectively. Genetic variance among the cultivars ranged from 0.003 to 0.738 in silicoDArT and 0.004 to 0.412 in SNP markers. Four distinct population groups were identified from SNP data analysis. Most of the accessions used in this study were descended from two or more populations. Cluster analysis clearly separated genotypes of distinct origins, such as the Hawaii Agricultural Experiment Station and Hidden Valley Plantation accessions. Two wild accessions of Macadamia jansenii and M. ternifolia were found to be distantly related to the cultivars. Wild germplasm individuals and their hybrids with cv. ‘660’ formed separate clusters, suggesting that crossing between wild and cultivated genepools can extend genetic diversity. DArTseq-based SNP markers were successfully utilized to confirm the genetic identity of 25 scions in a rootstock trial. Our study suggests that DArT platforms are a robust system for the facilitation of genomic studies with regard to macadamia.


Introduction
Macadamia is a cross-pollinated tree and constitutes a nut crop industry with a high value kernel [1,2]. Its unique flavours, multipurpose uses, and long shelf life have increased the demand for macadamia nuts worldwide. The genetic improvement of Australian macadamia cultivars is still at an early stage. Though the crop was first domesticated one and a half centuries ago [1,3,4], existing cultivars are only second to fourth generations from their wild progenitors [4,5]. Until recently, worldwide macadamia improvement programs were mostly dependent on pedigree analysis and phenotypic characterization [6][7][8][9], subjected to the inaccuracy involved in the selection of elite accessions due to the effects of environment and genotype-environment interactions. To reduce inaccuracies in selection procedure and to accelerate breeding efficiency, genomic techniques may be used as a tool of macadamia breeding program [10].
Molecular markers have been utilized in several crop species to ascertain the genetic diversity in the gene pool [11], identify QTLs and candidate genes conferring valuable traits [12,13], authenticate plant identity and the parentage of hybrids [14], generate data for gene expression profiling [15], and predict the genetic potentiality/performance of individual cultivars [16]. Over the last few decades, several marker technologies have been developed for macadamia, primarily for the study of genetic diversity. Some of the important markers used in macadamia are isozyme [17], randomly amplified DNA fingerprinting (RAF) [18,19], random amplified polymorphic DNA (RAPD) [20,21], amplified fragment length polymorphism (AFLP) [22,23], randomly amplified microsatellite fingerprinting (RAMiFi) [24] and microsatellite [25][26][27] markers. Limitations associated with these marker systems include low marker density, poor genome coverage, and less cost-effectiveness. In this perspective, sequence-based single nucleotide polymorphism (SNP) markers developed through automated sequencers can constitute an important choice for molecular studies due to their wide and uniform genome coverage with high-throughput and cost-effectiveness [28]. There is an increasing demand for the development of ultra-high-throughput low cost assays to facilitate the genotyping of individuals with the use of a large number of high-density markers that cover the entire genome. High-density array-based oligonucleotide markers such as single feature polymorphisms (SFPs), restriction site-associated DNA (RAD), and diversity array technology (DArT) provide a way to achieve this goal of development of cost-efficient ultra-highthroughput genotyping systems.
In 2001 [29], Diversity Array Technology Pty Ltd (DArT, Canberra, ACT, Australia) developed cost-effective sequence-independent ultra-high-throughput marker systems. DArT develops markers through a microarray hybridization method and can produce thousands of polymorphic loci in a single assay. This marker platform has been used for whole-genome scanning of a range of crop species [30][31][32][33][34][35][36]. Over the last decade, DArT has generated two types of markers: i) silicoDArT and ii) SNP markers. SilicoDArT markers are microarray markers that are dominant and scored for the presence or absence of a single allele. DArTseqbased SNPs are co-dominant markers. Both types of markers have been successfully applied in several crop species for genetic diversity [37][38][39][40][41], genetic mapping [36,[42][43][44], and population structure [45,46] studies. The present study is the first to utilize the DArT platforms in macadamia; we present the development of DArT marker platforms, and compare and analyse the usefulness of silicoDArT and SNP markers for genomic studies. DArT markers were applied to investigate the genetic diversity in cultivated and wild germplasm. DArT markers were also tested for utilization in plant identification.
Technology Pty Ltd. provided support in the form of salaries for author AK, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific role of this author is articulated in the 'author contributions' section.
Competing interests: Andrzej Kilian is employed by Diversity Arrays Technology Pty Ltd which provides service using DArTseq technology so he can nominally benefit from the publication, but this fact had no effect on the results and conclusions of this paper. There are no patents, products in development or marketed products to declare. This does not alter our adherence to all the PLOS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Plant materials
Macadamia cultivars of multiple origins including from Hawaii Agricultural Experiment Station (HAES), Hidden Valley Plantation (HVP), Israeli and Australian selections, accessions from wild germplasm, and progeny from the Australian industry macadamia breeding program (Australian elite selections, dwarves, breeding progeny and hybrids of wild germplasm) were employed for genetic diversity and population structure analysis (Table 1). Leaf samples from seven cultivars were collected from each of two different plants to use as biological replicates. All 21 cultivars were genotyped twice to use as technical replicates. Twenty-five scions from a rootstock trial at Bundaberg, Queensland, were also tested for plant identity to solve a mismatch with phenotypic characteristics observed in the field.

DNA extraction
Samples obtained from newly flushed young leaves were collected from arboretums and progeny trials of the Australian industry macadamia breeding program [47] in 2013 and 2014 growing seasons. Leaf samples were sealed in a zip-lock bag labelled with the corresponding tree barcode, kept cool in insulated bags with freezer blocks and sent to DArT (next day) for DNA extraction. Total genomic DNA was extracted by adhering to the modified CTAB protocol [48,49] as described by Kilian et al., (2012) [50]. The quality and quantity of the DNA samples were evaluated through a spectrophotometric analysis in DS-11FX series spectrophotometer/fluorometer (Denovix, Wilmington, DE, USA), followed by running agarose gel electrophoresis (1.2% agarose). The DNA concentration was adjusted within the range of 50-100 ηg μl -1 .

Genotyping macadamia accessions using silicoDArT and SNP markers
The high-throughput DArTseq technology was used to genotype macadamia cultivars. In this technology, the PstI-based complexity reduction method [39] was applied for the enrichment of genomic representation with single copy sequences. This method involved the digestion of DNA samples with a rare cutting enzyme PstI, paired with a set of secondary frequently cutting restriction endonucleases (RE), ligation with site-specific adapters, and amplification of adapter-ligated fragments. The secondary frequently cutting RE enzymes employed in this study were AluI, ApoI, BanII, Bsp1286I, BstNI, HaeIII, MseI, RsaI, MspI, HpaII, MseI, TaqI, and HhaI. Post digestion with a PstI-RE pair, a PstI overhang compatible oligocleotide adapter (5 0 -CAC GAT GGA TCC AGT GCA-3 0 annealed with 5 0 -CTG GAT CCA TCG TGC A-3 0 ) was ligated, and the adapter-ligated fragments were amplified in adherence to the prescribed standard procedures [39]. To develop SNP and silicoDArT markers, the DArTseq technology was optimized using two PstI-compatible adapters corresponding to two different RE overhangs. The genomic representations were generated following the procedures described by Kilian et al. [50] and PstI+HhaI was selected as the most appropriate complexity reduction method. Next-generation sequencing technology was implemented using HiSeq2000 (Illumina, USA) to detect SNPs and silicoDArT markers. The sequence data was analyzed using DarTsoft14, an automated genotypic data analysis program and DArTdb, a laboratory management system. Markers were scored '1' for presence, and '0' for absence and '-' for failure to score. Two technical replicates of the DNA samples of each of 21 cultivars were genotyped to calculate the reproducibility of the marker data.

Quality analysis of marker data
The markers were tested for reproducibility (%), call rate (%), polymorphism information content (PIC) and one ratio. Scoring of reproducibility involved the proportion of technical replicate assay pairs for which the marker score exhibited consistency. The call rate determined the success of reading the marker sequence across the samples and was estimated from the percentage of samples for which the score was either '0' or '1'. PIC is the degree of diversity of the marker in the population and showed the usefulness of the marker for linkage analysis. One ratio constitutes the proportion of the samples for which genotype scores equalled '1'.

Genetic dissimilarity analysis
Genetic dissimilarity matrices were constructed in DARwin v. 6.0.13 [51] to identify the genetic relationships among the cultivars. Weighted neighbour-joining dendrograms were constructed in both marker platforms. Clade strength in the dendrograms was tested by 20,000 bootstrap analyses. Biological and technical replicates of 28 accessions were compared to identify the reliability of the markers using the dissimilarity index between the replicates of each accession.

Population structure and genetic diversity analysis
The genetic structure of the germplasm was analyzed using STRUCTURE v.2.3.4 [52] and GenAlEx v.6.5 [53]. The number of hypothetical subpopulations (K) was estimated with the STRUCTURE software through the application of a Bayesian clustering approach for the organisation of genetically similar cultivars into the same subgroups. Ten individual Markov Chain Monte Carlo (MCMC) simulations were conducted for each K-value from 1 to 10 with a burnin length of 50,000, followed by 100,000 iterations. The admixture model was applied without using any prior population information. The log-likelihood of the observed data for each Kvalue was calculated and compared across the range of K values. The best K-value was estimated based on the membership coefficient (Q) for each individual in each cluster. The Q values indicate the level of relatedness of each accession to various subgroups. The STRUCTURE results were subsequently analyzed by the STRUCTURE HARVESTER application [54] (http://taylor0.biology.ucla.edu/structureHarvester/) to identify a distinct peak in the change of likelihood (ΔK) at the true value of K. GenAlEx was used to perform Principal Coordinates Analysis (PCoA), based on the standardized covariance of genetic distances calculated for the markers under evaluation, using 999 permutations. PCoA explains the genetic distances among the accessions.

Determining plant identity
In a rootstock trial at Bundaberg we identified that the phenotypic characteristics of 25 scions did not corroborate with their recorded variety. Using SNP markers we conducted PCoA and parentage analysis of all the mismatched scions and the results were compared with phenotypic variety allocations.
To determine plant identity, initially allelic variations were observed between the possible varieties and the mismatched cultivars using PCoA in GenAlEx. Later, the analysis of plant identity was performed with Cervus v.3.0.7 [55] to detect and confirm the candidate plant ID. In cases where the plant ID remained indeterminate, parentage analysis was conducted, allowing 40 simulated parents and 10,000 simulated offspring. The natural logarithm of the likelihood ratio, logarithm of odds (LOD), scores were provided for each candidate parent and the offspring-parents trio along with the confidence of these scores at strict (95%) and relaxed (80%) levels. Based on the analysis, the parent with the highest LOD is denoted as the putative parent. A positive LOD score indicates that the parent is more likely to be the true parent in comparison to one drawn at random from the population; negative LOD scores indicate that the parent is less likely to be the parent than one drawn at random from the population [55].

Marker quality analysis
Through the application of the complexity reduction method, a total of 22,280 polymorphic silicoDArT (S1 Table) markers were generated, of which 353 were aligned with the marker sequence obtained from bacteria (NCBI) and 154 from expressed sequenced tags (EST) of several plant species. We report these DArT markers for the first time, though the chromosomal location is yet to be described. Most of the markers (22,237) showed !95% reproducibility. All the identified silicoDArT markers had a call rate value !75% (Fig 1) with an average value of 97.44 (S1 Table). However, low frequency markers can affect the statistical analysis [56]. As such, 10,711 markers with extremely low one ratio (<0.05) were not considered in the analysis. In total, 11,526 silicoDArT markers cleared all the quality parameters and were selected for the study. Among the 11,526 informative markers, around 21% were observed in PIC class 0.45 to 0.50 and 13% in 0.05 to 0.10 class (Fig 2). PIC values of the remaining markers were distributed almost equally (8-10%) across the rest of the marker groups. Therefore, the median (0.28) was located extremely close to the average PIC value of 0.29 (S3 Table), and the data exhibited approximately equitable distribution on either side of the median.
A total of 7,332 SNP markers (S2 Table) were identified, and had an average of 99% reproducibility and 76% call rate. Around 98% SNP markers had !95% reproducibility, of which 6,375 were found to be 100% reproducible (Fig 1). The call rate exhibited variance ranging from 38% to 100%. Around 44% of SNP markers displayed a <75% call rate (Fig 1), and were therefore not considered for this study. For the remaining markers, 2,650 showed a >90% call rate. All the identified markers had >0.5 average one ratio. Considering all the quality parameters, 3,956 SNP markers were used for subsequent analysis. These markers were determined to be highly informative with an average PIC value of 0.21, and 0.19 median (S3 Table). Around 14% of markers were in the lowest PIC value range (0-0.05) and 9% in the highest PIC value range (0.45 to 0.50) (Fig 2). The remaining PIC value groups exhibited an approximately similar marker frequency value ranging from 10 to 12% each.

Population structure and genetic diversity analysis
The model-based Bayesian cluster analysis in STRUCTURE visualized the genetic structure of the population under examination (Fig 4). The values of ΔK, which were estimated from SNP markers peaked at K = 4 ( Fig 4A); hence, four distinct groups were found to contribute significant genetic information across cultivars. The sub-populations were denoted as POP1, POP2, POP3 and POP4, and each sub-population contained 41%, 7%, 23%, and 29% of total accessions, respectively ( Table 2).
The genetic diversity within each population was explained through the estimation of the expected heterozygosity, which varied from 0.12 (POP3) to 0.22 (POP4). The expected heterozygosity of POP1 was 0.19 and that of POP2 was 0.14. The genetic divergence among the populations revealed by Nei's net nucleotide distance (D) indicated that POP2 was widely related to POP1 (D = 0.19), POP3 (D = 0.18), and POP4 (D = 0.18), respectively. The genetic distance observed between POP1 and POP4 (D = 0.05) was the least among the pairs of populations examined ( Table 2).
The proportion of membership of individual cultivars in each population is illustrated in the bar plot of the population assignment test in structure analysis (Fig 4B). Conversely, all HVP accessions were constituted partly of POP4. Two HAES cultivars '791' and '762' also shared large amounts of genetic information with POP4. Cultivar '791' was the only individual allocated to all four populations, and possessed 33% genetic information from POP1, 13% from POP2, 9% from POP3, and 46% from POP4.
PCoA illustrated the genetic divergence among the cultivars (Fig 5). In silicoDArT and SNP markers, the first two axes of the PCoA explained 25.64% and 37.61% of the total genetic divergence, respectively. The population distribution determined by both markers is consistent with the output of population structure analysis (Fig 4B). HAES cultivars were located in the top two quadrants, while HVP and Australian selections were mainly located in the bottom quadrants. Australian breeding lines displayed wide diversity, as they were mostly distributed throughout the three quadrants of the PCoA. Only a few accessions were in the bottom right quadrant of PCoA, which was predominantly occupied by wild germplasm.

DArT platforms provide cost-effective ultra-high-throughput reliable markers
Our study highlights the suitability of DArT platforms that can be applied for the genomic dissection of macadamia cultivars. A total of 22,237 silicoDArT markers were developed, of which 11,526 markers provided robust information of the macadamia genome in the absence of sequence information. On the other hand, DArTseq SNPs provided 3,956 informative markers. The development of a large number of these high-throughput markers provided an opportunity to anchor the markers on a recently released macadamia reference genome, which is yet to be completed [57].
The quality parameter of both silicoDArT and SNP markers in macadamia were comparable with that of other species. The average PIC values of both markers in macadamia was similar to values identified in DArT markers of sugar beet (0.28) [58] and Asplenium fern (0.20) [59], but lower than that of sorghum (0.41) [36], cassava (0.42) [60], and wheat (0.44) [61], and higher than SNP markers in watermelon (0.13) [62] and Lesquerella (0.12) [63]. The average PIC values of silicoDArT was greater than that of SNP markers. Around 30% of silicoDArT and 20% of SNP markers showed PIC values within the range of 0.40 to 0.50. The silicoDArT markers were therefore more informative than SNP markers. The abundance of silicoDArT and SNP markers may achieve better genome coverage through the sampling of a greater number of points in the whole genome, as marker density has a high correlation with gene density [50,64]. Previous studies in macadamia using other platforms utilized only a small number of molecular markers, for example, O'Connor et al. (2015) used only 11 microsatellite markers [25]. Therefore, both silicoDArT and SNP markers may better suit for genetic diversity studies, association/linkage mapping and/or sequence based physical mapping in macadamia. Additionally, the co-dominant inheritance pattern of SNP markers may increase the utility of DArT platforms for genetic identity and parentage analysis.
In comparison with the other existing marker technologies like microsatellite markers, DArT markers are pertinent to high-throughput work and have merits in terms of cost effectiveness and time aspect [65]. The cost of marker development of silicoDArT and SNP markers are the same. Due to the higher number of markers produced in silicoDArT, the average cost per data point is less than SNP markers. However, the effectiveness of both platforms varies depending on the type of application. For genetic diversity and linkage mapping large number of silicoDArT markers are suitable. However, for genetic identity and product quality testing, both markers can perform equally. Due to the opportunity to track alleles from parental genotypes, the co-dominant SNP markers are more suitable in plant identity and parentage analysis than silicoDArT.

DArT markers successfully evidenced the historical background and pedigree relationships of the cultivars
The genetic diversity analysis illustrated the historical development of macadamia cultivars. Macadamia cultivars have been developed by few breeding programs. The macadamia breeding program in Hawaii began in approximately 1948 and continued to 1990, producing many HAES cultivars which are still dominant in plantings around the world [66]. Australian varieties including 'Beaumont', 'D4', and 'Daddow' are the product of 1952-onwards seedling surveys in Australia, after observation of the Hawaiian industry success. Hidden Valley Plantation (HVP), which commenced in 1972 in Beerwah, Queensland, Australia released successful 'A' series cultivars from open-pollinated progeny of HAES and Australian varieties. Genetic diversity analysis clearly separated most of the accessions of these three groups. Accessions of Australian industry breeding efforts were selected from the crossing program initiated in 1992 and included HAES, HVP and Australian cultivars in the parentage, hence are scattered across all clusters.
The statistical analysis of DArT data sets showed highly consistent results obtained for genetic diversity, population structure and PCoA. The pedigree relationships obtained through genetic diversity and population structure analysis are in line with the results of previous studies [17,18,23]. Two HAES cultivars '660' and '741', which were selected from same orchard at Glaisyer (at Hawaii) showed strong genetic similarity, as also observed by Peace et al. [18] and Steiger et al. [23]. Cultivar '344' was grouped with '741' and '660', which is similar to previous observations made with RAF [18], isozyme [17], RAMiFi [24], and AFLP [23] markers. Population structure analysis revealed that cultivar '791' constituted three or four distinct populations, which was supported by the observation of Peace et al. [18]. It was identified that '791' is a tri-species cultivar containing M. integrifolia, M. tetraphylla and M. ternifolia in its ancestry. Our results additionally showed that '791' may also have M. jansenii in its genetic background. We identified that '791' shares almost equal amounts of M. integrifolia and M. tetraphylla. In contrast, Peace et al. [18] previously reported 10% M. tetraphylla, 55% M. integrifolia and 35% M. ternifolia in the ancestry of '791'. Interestingly, population structure analysis identified that two wild species M. jansenii and M. ternifolia represented the same population ( Fig 4A). Although no historical relationships among these two wild species were evidenced, there are some clear morphological similarities [1]. For example, both M. jansenii and M. ternifolia are relatively dwarf accessions, produce pink red leaf flushes and flowers, and have bitter nuts. Previous isozyme analysis showed that M. jansenii and M. ternifolia are closely related [67], as observed in RAF [68] and STMS (sequence-tagged microsatellite site) studies [3], whilst Waldron et al. [68] and Peace et al. [3] suggested that M. ternifolia and M. jansenii were of sister species. An investigation on large numbers of wild accessions would be instrumental to explore differentiation among the four macadamia species. Identification of markers showing the signature of domestication could be the focus of future studies.

DArTseq based SNP markers are useful tools for plant identity confirmation
The use of molecular markers for plant identity and parentage analysis has increased over the last two decades [69]. Although microsatellite markers were used previously in parentage analysis [25], this study was the first to use DArT markers in parentage analysis in macadamia. Using SNP markers, our study successfully predicted the plant identity and probable mother identities of plants in a rootstock trial. Most of the predictions validated field observations (S7 Table). DArTseq-based SNP markers are therefore useful to confirm plant identity through parentage analysis of macadamia progeny.

Conclusions
Our study identified that both DArT platforms are of high quality, supported by the quality parameters and close relationships of the replicates of each cultivar in the neighbourhood join dendrograms. Both DArT markers successfully reflected the parental relationships and the extent of diversity in the population. Structure analysis clearly separated the population of various origins. DArTseq-based SNPs successfully demonstrated population structure across cultivars and confirmed plant identity in a mismatched rootstock trial. We therefore suggest that both silicoDArT and SNP markers are robust and an inexpensive option for breeders for genomics studies in macadamia. Both DArT platforms developed a large number of highly polymorphic markers, and hence are useful for linkage mapping in macadamia.
Supporting information S1