Leaf Transcriptome Sequencing for Identifying Genic-SSR Markers and SNP Heterozygosity in Crossbred Mango Variety ‘Amrapali’ (Mangifera indica L.)

Mango (Mangifera indica L.) is called “king of fruits” due to its sweetness, richness of taste, diversity, large production volume and a variety of end usage. Despite its huge economic importance genomic resources in mango are scarce and genetics of useful horticultural traits are poorly understood. Here we generated deep coverage leaf RNA sequence data for mango parental varieties ‘Neelam’, ‘Dashehari’ and their hybrid ‘Amrapali’ using next generation sequencing technologies. De-novo sequence assembly generated 27,528, 20,771 and 35,182 transcripts for the three genotypes, respectively. The transcripts were further assembled into a non-redundant set of 70,057 unigenes that were used for SSR and SNP identification and annotation. Total 5,465 SSR loci were identified in 4,912 unigenes with 288 type I SSR (n ≥ 20 bp). One hundred type I SSR markers were randomly selected of which 43 yielded PCR amplicons of expected size in the first round of validation and were designated as validated genic-SSR markers. Further, 22,306 SNPs were identified by aligning high quality sequence reads of the three mango varieties to the reference unigene set, revealing significantly enhanced SNP heterozygosity in the hybrid Amrapali. The present study on leaf RNA sequencing of mango varieties and their hybrid provides useful genomic resource for genetic improvement of mango.


Introduction
Mango (Mangifera indica L.) is an evergreen dicotyledonous angiosperm. Although several tetraploid Mangifera species are reported, cultivated mango is a diploid tree (2n = 40) with a relatively small genome size of 439 Mbp [1,2]. The genus Mangifera belongs to order Sapindales of the family Anacardiaceae [1]. The major mango producing countries are India, China, Thailand, Indonesia and Pakistan. India is the largest producer of mango in the world,

RNA isolation and library preparation for sequencing
Total RNA was extracted from the leaves using Purelink miRNA isolation kit according to manufacturer's instructions (Invitrogen). Quantification of RNA was done using NanoDrop spectrophotometer and quality check for DNA contamination was done by electrophoresis in 1% denaturing agarose gel. To assess RNA integrity the samples were run on RNA 6000 Pico chip Bioanalyzer (Agilent). The transcriptome library was prepared after quantification and quality check of the Poly(A) RNA using SOLiD total RNA Seq kit for 'Neelam' and 'Dashehari' and Illumina MiSeq for ' Amrapali' , respectively.

De novo transcript assembly
High quality sequence reads of cDNA libraries were used to generate transcript shotgun assembly (TSA) contigs. Transcripts for 'Neelam' and 'Dashehari' were assembled via de novo assembly approach using Velvet-Oasis software, which have been developed for short read assembly of transcriptome data and are based on de-Brujin graph algorithm [33,34] Longer sequence reads of ' Amrapali' were assembled using CLCGenomics Workbench (version 6.5.1) [35]. The assembled TSA contigs of the parents and hybrid were further merged in to a non-redundant set of unigene contigs using CAP3 [36] with default parameters (overlap length cutoff = 40bp and overlap percent identity cutoff = 90%). This unigene set was used for mining of SSRs and for a three way alignment of sequence reads for SNP identification among the three mango genotypes.

SSR detection and primer designing
The final assembled unigene contigs were used for mining of genomic-SSRs using MISA [37] and SSR specific primers were designed using BatchPrimer3 V1.0 [38]. The SSR loci containing repeat units of 2-6 nucleotides only were considered. The criteria for minimum SSR length were defined as 6 reiterations for di-nucleotide SSR and 5 reiterations for tri-, tetra-, penta-and hexanucleotide SSRs, mono-nucleotide repeats and complex SSR were excluded [39]. The parameters for designing of primers from SSR flanking sequence were: primer lengths = 20-25 bp; PCR product size = 100-250 bp; annealing temperature = 65°C; GC content = 40-60% with an optimum of 50%; only single consecutive bases of Gs and Cs at the 3' end of both primers were specified. Remaining parameters were kept at the default setting for BatchPrimer3 V1.0.

SSR marker validation
Genomic DNA was isolated from leaf samples of eight genotypes using CTAB method [40], quantified by UV260 absorbance and adjusted to a final concentration of 30 ng/μl. A set of 100 genic-SSR markers with SSR lengths of 20 bp or above (Type I SSR) were tested for amplification using genomic DNA of ' Amrapali' for optimization of the annealing temperature. The PCR reactions were performed in a BioRad Thermal Cycler. Each PCR reaction consisted of 1.5 μl of 10X reaction buffer, 0.20 μl of 10 m MdNTPs (133 μM), 1.5 μl each of forward and reverse primers (10 pMol), and 2.5 μl of template genomic DNA (75 ng), 0.15 μl of Taq DNA polymerase (0.75 U) in a final reaction volume of 15 μl. The PCR reaction profile was denaturation at 94°C for 5 min followed by 35 cycles of 94°C for 1 min, 55°C for 1 min, 72°C for 1 min and finally, 72°C for a final extension of 7 min. Re-screening of primers that did not amplify at these conditions was done by decreasing the annealing temperature sequentially by 1°C, and for the primers producing multiple bands, by increasing the annealing temperature by 1°C [21]. The optimized SSR primers were then used for PCR amplification of multiple varieties of mango. The PCR products were separated by electrophoresis in 4% Metaphor agarose gels (Lonza, Rockland ME USA) containing 0.1 μg/ml ethidium bromide in 1X TBE buffer at 130 V for 4 h. After electrophoresis, PCR products were visualized and photographed using a gel documentation system Fluorchem™ 5500 (Alfa Innotech Crop., USA). The SSR profiles were scored manually, each allele was scored as present (1) or absent (0) for each of the SSR loci. The SSR markers giving consistent expected size products only (100-250 bp) were used for further analysis of variation.

Annotation and functional classification of the unigene TSA contigs
For the annotation of TSA unigenes, BLAST algorithm [41] was used to search for similarity against a locally configured non-redundant (nr) protein database of NCBI (as on 06 December, 2015) using BLASTX program [42] with cutoff E-value of 1e-6. The BLASTX result was saved in the.xml format and was imported into Blast2GO software [43] to assign Gene Ontology (GO) terms to the annotated unigene. Blast2GO classified unigenes under three GO terms called cellular component, biological process and molecular function. The GO annotated unigenes with GO terms were exported from Blast2GO in WEGO native format, and online tool WEGO (Web Gene Ontology Annotation Plot) [44] was used for the categorization of annotated unigenes in to three GO categories. For categorization of unigenes into 58 transcription factor families, unigenes were searched against the downloaded protein sequences of Plant Transcription Factor database version 3.0 (PlantTFDB 3.0) [45] using BLASTX with E-value cutoff of 1e-6. A COG classification was also performed using the same BLASTX search parameters against NCBI COG databases [46].

Pathway mapping using KAAS
KEGG Automatic Annotation Server (KAAS) [47] was used for Pathway mapping and gene ortholog assignment of the unigenes. The KAAS gives functional annotation of genes by sequence similarity comparison against the manually curated KEGG GENES database [48]. Based on the similarity hits in the KEGG database using BLASTX (default threshold bit-score value of 60), unigenes were assigned with the unique enzyme commission (EC) numbers, and further mapped to the KEGG biochemical pathways.

Multiple sequence alignment and SNP identification
The unigene set was used as reference for mapping of high quality filtered sequence reads from all the three varieties using BWA with default parameters [49]. To track and identify the variety specific reads mapping at a particular location we added the variety name at the end of header of each high quality reads using shell script. SAMtools software was used for conversion of aligned SAM file to BAM file and read sorting [50]. The SNPs were called using software VarScan version 2.7 [51] at highly stringent parameters: 1) minimum 10 reads mapped at each SNP position; 2) average base quality of !25; 3) minimum two reads for any SNP base call in each variety. Total mapped reads information including read name, base call with respect to reference position for all identified SNPs were fetched from the duplicate read removed BAM file using shell scripts and the final results were filtered and tabulated.

Results and Discussion
Functional categories of genes expressed in the mango leaves A total of 60,359,815, 58,212,961 and 4,853,226 raw sequence reads were generated for 'Neelam' (mango_N), 'Dashehari' (mango_D) and ' Amrapali' (mango_A), respectively using two runs of SOLiD sequencing with average read length of 50 bp (mango_N and mango_D) and one run of Illumina Miseq 2x250 (mango_A) with average read length of 250 bp. After quality check, adapter trimming and removal of low quality reads, 53,617,132, 47,818,267 and 4,313,270 high quality reads were retained for 'Neelam' , 'Dashehari' and ' Amrapali' , respectively. We performed three separate de-novo assemblies using Velvet-Oases assembly pipeline and CLC Genomic workbench 6.5.1 for the SOLiD and Miseq data, respectively. The assembly resulted in 27,528, 20,771 and 35,182 transcripts for mango_N, mango_D and mango_A with transcript N50 values of 557 bp, 451 bp and 591 bp and largest transcript size of 3,129 bp, 2,958 bp and 5,891 bp, respectively (Table 1).
TSA contigs of individual genotypes were further assembled into 70,057 non-redundant unigenes set, which was used as reference for the identification of genic-SSR and SNP (Fig 1). The mean size of earlier reported 85,651 unigene contigs is 415 bp for the leaf transcriptome of mango variety 'Langra' with mean length of 238 bp [52], which is much lower than the present result. Further, sequencing of pooled transcriptome from pericarp and pulp of mango variety 'Zill' has resulted in 124,002 transcripts with average size of 838 bp [53]. Transcriptome from mango variety 'Shelly' generated 57,544 transcripts with an average length of 863 bp [54], and mesocarp transcriptome of mango variety 'Kent' is reported with 80,969 transcripts having mean length of 836 bp and N50 of 1,456 bp [55], which is significantly larger than our results as their transcriptome was sequenced using only Illumina platform which produces longer read length as compare to SOLiD sequenced reads.
Raw sequenced data described in this paper can be found in the Sequence Read Archive (SRA) database of the NCBI with SRA Accession number SRR1298995, SRR1297075, SRR1956775 under BioProject Number PRJNA193591, PRJNA193588, PRJNA279829 with TSA Accession number GBVX00000000, GBVW00000000, GEEEC00000000 for 'Dashehari' , 'Neelam' and ' Amrapali' , respectively. For the functional annotation 70,057 unigene contigs were searched in the NCBI-nr protein database using BLASTx. As a result 39,798 (56.80%) of the unigenes showed significant similarity to known proteins and were functionally annotated while 30,259 (43.2%) unigenes showed no significant hits. Mango unigenes showed the highest similarity with Citrus sinensis (28.76%), followed by Citrus clementina (17.43%), Theobroma cacao (6.8%), Jatropha curcas (4.44%) and Vitis vinifera (4.19%) (S1 Fig). This result is consistent with the phylogenetic study of mango chloroplast DNA which reported Citrus to be most closely related to Mangifera indica [53].
Gene ontology (GO) terms were assigned successfully to 26,001 of the BLASTX annotated unigenes using BLAST2GO, which were broadly categorized into three main categories; biological process (BP), cellular component (CC) and molecular function (MF) and were further classified into 47 functional groups (Fig 2). The most abundant unigenes were in the biological process category followed by molecular function and cellular components. In the biological process category the highly represented GO terms were "metabolic process", "cellular process" and "biological regulation" while in molecular function the highest represented GO terms were "catalytic activity" and "binding" whereas in cellular component category the highest represented GO terms were "cell", "cell part" and "organelle". Somewhat similar results have been reported earlier for 'Langra' and 'Kent' leaf transcriptomes with highly represented GO terms "cell" and "cell part" in the cellular component category, "metabolic process" and "cellular process" in biological process and "catalytic activity", "binding" in molecular function categories [52,55].
Total 8,958 unigenes were assigned Enzyme Commission (EC) numbers; the highly represented enzyme classes were hydrolases (3,335) transferases (3,091) and oxidoreductases (1,437). The large number of unigenes under these three major enzyme groups indicates expression of genes related to secondary metabolite biosynthesis pathway in the mango leaves [52,53].
KASS server was used for pathway mapping and orthologous gene assignment for the assembled unigenes, and we mapped 4,977 unigenes to 349 different KEGG pathways. The most represented pathway in terms of total number of hits from the transcript data were "ribosome", "biosynthesis of amino acids", "carbon metabolism", "spliceosome", "purine metabolism" and RNA transport which is quite similar to the results with the transcriptome of mango variety 'Kent' which showed the maximum transcripts representations for "biosynthesis of amino acids", "ribosome" and "RNA transport" [55], but it was different from the results with variety 'Zill' [53], where the maximum representation of transcripts was for "metabolic pathways", "biosynthesis of secondary metabolites", and "plant-pathogen interaction" pathways (S1 Table). Unigenes belonging to different transcription factor families were identified using local similarity search (BLASTx) against plant transcription factor database (PlantTFDB v3.0) and NCBI COG database. Total 12,539 and 9,847 unigenes showed significant similarities with the PlantTFDB and COG database, respectively and were categorized in to 58 PlantTFDB and 24 COG families. There are three important transcription factor families in plants, namely bHLH, NAC and MYB that have been studied in detail, and here out of the 12,539 significant matches the maximum number of unigenes were categorized in bHLH (1,229) followed by NAC (1,059), MYB (801), WRKY (766) and B3 (763) families of transcription factors (Fig 3). The COG classification of unigenes into different functional cluster of orthologous groups (COG) based on BLASTx search classified unigenes into 24 COG categories. The largest category was of general functions (1676), followed by post-translational modifications, protein turnover, chaperones (1106), translation, ribosomal structure and biogenesis (946), energy production and conversion (643), amino acid transport and metabolism (639) and carbohydrate transport and metabolism (596). The least represented categories were for cell motility (33) and nuclear structure (3), while no unigene was categorized into extracellular structures (Fig 4).

Development and validation of genic-SSR markers
A total of 5,465 SSR loci of different categories (mono, di, tri, tetra, penta, hexa and complex) were identified in 4,912 contigs, representing 7.01% of the total 70,057 unigenes. We excluded mononucleotide repeats, complex SSR and those having total lengths of <10 bp because SSR marker based mono-nucleotide repeats are not reproducible due to recombination slippage and PCR amplification problems, whereas complex SSRs show the least polymorphism [39]. This exclusion left only 1,481 Type I SSRs for primer design. Among the SSR containing unigenes, 1,438 (95.52%) unigenes possessed a single SSR locus, while 43 (4.44%) unigenes had two or more SSR loci each. As expected for coding sequences, tri-nucleotides were the most common repeat units representing 774 (52.26%) of the total filtered SSR, followed by di-nucleotide 641 (43.28%), tetra-nucleotide 47 (3.17%), penta-nucleotide 12 (0.82%) and hexa-nucleotide 7 (0.47%) repeats. Maximum percentage of SSR repeats constituted by tri-nucleotide and di-nucleotide 985 (71.80%) while only 387 (28.20%) of SSR constituted of tetra-nucleotide, penta-nucleotide and hexa-nucleotide repeats. The most abundant SSRs were with five reiterations, the frequency of a given SSR structure and the number of repeat units in it showed an inverse relationship (S2 Fig). Hence, SSR loci with less than five repeats are expected to be even more abundant but were not included in the present investigation because they would be less useful in the study of detectable polymorphism [39]. SSR motifs showing more than ten reiterations were rare with a frequency range of 1.28%-0.06%. Total 133 distinct repeat motifs were identified in 1,438 genic-SSRs, the 11 most frequent motifs are shown in (S2 Table). Di-nucleotide repeats AG/CT, AT/AT and AC/GT were the most abundant SSR with frequencies of 19.04%, 17.95% and 6.65%, respectively. Among the tri-nucleotide repeats, AAG/CTT and ATC/GAT were the most abundant with frequencies of 17.75% and 8.33%, respectively.
PCR primers were designed successfully from the unique sequences flanking 1,069 SSR loci for the development of genic-SSR markers and were designated MSSR1 to MSSR1069 (M = Mango). Primers could not be designed for the remaining 4,396 SSR loci because their flanking sequences were either too short or the nature of sequence did not fulfill our criteria for primer design. Of the 1,069 SSR markers, 227 type I SSR loci (n ! 20 bp) (S3 Table) were filtered out and of this primers were synthesized for 100 loci for validation due to their high chance of showing polymorphism in agarose gel electrophoresis [56]. Out of the 100 synthesized primer pairs, 43 yielded PCR amplicon of expected size and were designated as "validated genic-SSR markers" (Table 2, Fig 5a). In addition, 36 primer pairs amplified !3 bands and 21 primer pairs failed to amplify even when the annealing temperature was reduced by 7°C (Fig 5b). All the amplified genic-SSR markers were scored for their amplicon size in eight varieties. Although a large proportion of the SSR loci were monomorphic, some of these will show polymorphism on analysis of a larger set of varieties. These markers have already been  utilized successfully for diversity analysis in among 96 mango cultivars in a separate study [57]. Further, use of more sensitive techniques for DNA fragment size analysis, e.g. polyacrylamide gel electrophoresis or capillary electrophoresis, is also expected to show a higher rate of polymorphism.

SNP heterozygosity in mango varieties Neelam, Dashehari and their hybrid Amrapali
The non-redundant set of 70,057 transcripts was used as a reference for mapping of high quality reads from parents ('Neelam' and 'Dashehari') and hybrid (' Amrapali') and total 42,984 SNP positions were identified with full details saved as text file. In-house shell scripts was written which reads the transcript name and its SNP position from this text file, and extracts all the mapping reads information at each SNP position, including the full read name, base call in the mapped read at SNP position and position of the SNP in the reads as well as in the reference transcript. The results were tabulated in an Excel sheet, which included variety wise information on heterozygosity at each SNP position. This helped identify level of heterozygosity in the parents and hybrid ' Amrapali' using stringent criteria. After quality filtration we identified 22,306 SNPs in 10,571 transcripts common to all the three genotypes with an average of 2.1 SNPs and a range of 1-16 SNPs per contig (Fig 6). We classified the SNPs into two categories: i) homozygous within a variety and ii) heterozygous within a variety, and found that in 'Neelam' the proportion of heterozygous SNPs was 49.3% while in 'Dashehari' it was only 30.19%. Interestingly, in their hybrid Amrapali the heterozygosity level increased to 64.5% (Table 3). Further, the 22,306 SNPs were classified in to eight categories on the basis of polymorphism in the hybrid ' Amrapali' vis-a-vis in its parental lines 'Neelam' and 'Dashehari' at the same position. ' Amrapali' showed 6,831 novel heterozygous SNP loci that were homozygous for contrasting alleles in the two parental varieties, which could be the basis of its superior performance. In addition, ' Amrapali' has maintained heterozygosity at 7,561 SNP loci that are heterozygous either in 'Neelam' (4,158 loci) or 'Dashehari' (2,049 loci) or both the parents (1,354 loci). On the other hand there were 6,760 SNP loci, which were heterozygous in either one or both the parents but became homozygous in ' Amrapali' ( Table 3). The loss of heterozygosity from 'Neelam' was more than twice as compared to 'Dashehari' , which is consistent with the higher level of heterozygosity in 'Neelam' . Surprisingly, there were 2,061 SNP loci that were homozygous for contrasting alleles in the two parents but showed homozygosity in ' Amrapali' , this was clearly due to lack of sufficient number of sequence reads from ' Amrapali' at these positions.

Conclusions
In this study we have presented development and validation of a comprehensive set of genic-SSR and SNP markers in mango by deep sequencing of leaf transcriptome using next generation sequencing technology. Considering the need to generate large number of molecular markers for mango breeding applications, a set of 288 type I genic-SSR markers were developed, and a subset of these was validated successfully for robust amplification in eight mango varieties. These markers can be used for diversity analysis and genetic mapping of useful horticultural traits in mango. In addition, an analysis of 22,306 SNP loci in hybrid mango clone ' Amrapali' and its parental lines 'Neelam' and 'Dashehari' revealed substantially higher heterozygosity in ' Amrapali' . Among the parental lines 'Neelam' showed significantly higher level of heterozygosity than 'Dashehari' . The identified genic-SNPs provide much-needed resource for the development of high density, cost-effective genotyping assays which would greatly help the mango breeding programs and genome wide association studies for the yield and quality traits in mango.