Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Leaf Transcriptome Sequencing for Identifying Genic-SSR Markers and SNP Heterozygosity in Crossbred Mango Variety ‘Amrapali’ (Mangifera indica L.)

  • Ajay Kumar Mahato,

    Affiliation ICAR-National Research Centre on Plant Biotechnology, Pusa Campus, New Delhi, India

  • Nimisha Sharma,

    Affiliation The Division of Fruits and Horticultural Technology, ICAR-Indian Agricultural Research Institute, Pusa, New Delhi, India

  • Akshay Singh,

    Affiliation ICAR-National Research Centre on Plant Biotechnology, Pusa Campus, New Delhi, India

  • Manish Srivastav,

    Affiliation The Division of Fruits and Horticultural Technology, ICAR-Indian Agricultural Research Institute, Pusa, New Delhi, India

  • Jaiprakash,

    Affiliation The Division of Fruits and Horticultural Technology, ICAR-Indian Agricultural Research Institute, Pusa, New Delhi, India

  • Sanjay Kumar Singh,

    Affiliation The Division of Fruits and Horticultural Technology, ICAR-Indian Agricultural Research Institute, Pusa, New Delhi, India

  • Anand Kumar Singh,

    Affiliation The Division of Fruits and Horticultural Technology, ICAR-Indian Agricultural Research Institute, Pusa, New Delhi, India

  • Tilak Raj Sharma,

    Affiliation ICAR-National Research Centre on Plant Biotechnology, Pusa Campus, New Delhi, India

  • Nagendra Kumar Singh

    Affiliation ICAR-National Research Centre on Plant Biotechnology, Pusa Campus, New Delhi, India

Leaf Transcriptome Sequencing for Identifying Genic-SSR Markers and SNP Heterozygosity in Crossbred Mango Variety ‘Amrapali’ (Mangifera indica L.)

  • Ajay Kumar Mahato, 
  • Nimisha Sharma, 
  • Akshay Singh, 
  • Manish Srivastav, 
  • Jaiprakash, 
  • Sanjay Kumar Singh, 
  • Anand Kumar Singh, 
  • Tilak Raj Sharma, 
  • Nagendra Kumar Singh


Mango (Mangifera indica L.) is called “king of fruits” due to its sweetness, richness of taste, diversity, large production volume and a variety of end usage. Despite its huge economic importance genomic resources in mango are scarce and genetics of useful horticultural traits are poorly understood. Here we generated deep coverage leaf RNA sequence data for mango parental varieties ‘Neelam’, ‘Dashehari’ and their hybrid ‘Amrapali’ using next generation sequencing technologies. De-novo sequence assembly generated 27,528, 20,771 and 35,182 transcripts for the three genotypes, respectively. The transcripts were further assembled into a non-redundant set of 70,057 unigenes that were used for SSR and SNP identification and annotation. Total 5,465 SSR loci were identified in 4,912 unigenes with 288 type I SSR (n ≥ 20 bp). One hundred type I SSR markers were randomly selected of which 43 yielded PCR amplicons of expected size in the first round of validation and were designated as validated genic-SSR markers. Further, 22,306 SNPs were identified by aligning high quality sequence reads of the three mango varieties to the reference unigene set, revealing significantly enhanced SNP heterozygosity in the hybrid Amrapali. The present study on leaf RNA sequencing of mango varieties and their hybrid provides useful genomic resource for genetic improvement of mango.


Mango (Mangifera indica L.) is an evergreen dicotyledonous angiosperm. Although several tetraploid Mangifera species are reported, cultivated mango is a diploid tree (2n = 40) with a relatively small genome size of 439 Mbp [1, 2]. The genus Mangifera belongs to order Sapindales of the family Anacardiaceae [1]. The major mango producing countries are India, China, Thailand, Indonesia and Pakistan. India is the largest producer of mango in the world, with an annual production of 18–19 Mt from an area of 2.31 Mha, contributing about 40% of the total world production (FAOSTAT-2014 [3]). It is grown in almost all the states of India, Andhra Pradesh tops in total production, whereas Uttar Pradesh tops in area. Andhra Pradesh, Uttar Pradesh, Bihar, Karnataka, Maharashtra, West Bengal and Gujarat together contribute about 82% of the total mango production in India [4]. More than 1,000 varieties of mango exist in India today, which contribute 39.5% of the total fruit production in the country [56]. There are 25 major commercial cultivars of mango some of which are highly preferred in the International market. According to Agricultural and Processed Food Products Export Development Authority (APEDA) India has exported 41,280 tons of mango worth around 50.7 million US dollars during 2013–14, United Arab Emirates, United Kingdom, Saudi Arabia, Kuwait, Qatar and United States are the major exporting destination. Mango is used in various forms such as pickle, chutney, jelly, cool summer drinks and as vegetable dishes. Peel and pulp of mango are rich in carotenoids and polyphenols like xanthonoid and mangiferin [7, 8]. Notwithstanding its enormous benefits, mango is a difficult plant to handle in breeding due to long juvenile phase, high heterozygosity, heavy fruit drop and large area required for assessment of the hybrids [912]. Cultivars from North India have the problem of alternate bearing, while South Indian cultivars are generally regular bearer. ‘Neelam’ is a high yielding, late season mango variety of South India that has regular bearing, medium size fruits, good flavor, yellow fibreless soft flesh and good keeping quality. ‘Dashehari’ a mid-season and most popular varieties of North India, with medium fruit size, sweet, pleasant flavor, thin stone, firm and fibreless pulp, and good keeping quality but it has the problem of alternate bearing [9]. A cross between ‘Dashehari’ and ‘Neelam’ resulted in the development of ‘Amrapali’, a popular dwarf regular bearing variety with small fruit size, good taste and keeping quality. Molecular breeding and gene discovery in mango has been limited by paucity of informative molecular markers, which is primarily due to non-availability of the mango genome sequence and other genomic resources. A variety of molecular markers including AFLP, RAPD, RFLP, COSII and SSR have been developed and applied to intra and inter-specific crosses of mango but with limited success in molecular breeding [1317]. Among different types of molecular markers that have been developed during the past three decades, SSR and SNP are highly informative markers [18, 19]. Microsatellite or SSR markers are one of the most informative and versatile DNA-based markers used in plant genetic research [19, 20], but traditionally their development has been costly and difficult. Prior to the arrival of next generation sequencing (NGS) technologies, Low throughput electrophoresis or capillary sequencing was used for SNP discovery [21, 22]. Using NGS technologies and recent bioinformatics tools identification of SNP and SSR markers in a genome has become feasible and affordable [23, 24]. It allows efficient identification of large numbers of microsatellites at a lesser cost and effort as compare to the traditional approaches [21, 25, 26]. The main advantage of developing SSR markers from NGS transcriptome sequences is the increased possibility of finding associations with functional genes and therefore with phenotypes. Microsatellites in the coding region of genes may actually regulate gene expression and function, making them a valuable resource for genetic studies and breeding applications [27]. NGS technologies provide ultra-fast and inexpensive methods for unraveling the genome and transcriptome of plants [28]. These sequencing technologies can now be used for allele mining, gene discovery and genome-wide identification of SNP in non-model organisms [2932]. The present study on sequencing of RNA from the leaves of mango varieties ‘Neelam’, ‘Dashehari’ and ‘Amrapali’ was aimed at identification of genic SSR and SNP loci and level of heterozygosity in the genomes of these selected mango varieties and their hybrid.

Materials and Methods

Plant material

Leaf samples of three mango genotypes ‘Neelam’, ‘Dashehari’ and ‘Amrapali’ were taken from the orchard of the Division of Fruits and Horticultural Technology, IARI, New Delhi, India. The fresh leaves were immediately frozen in liquid nitrogen and stored at −80°C until RNA extraction. For SSR validation we have used fresh leaf samples of 8 different popular varieties of mango viz., ‘Neelam’, ‘Dashehari’, ‘Amrapali’, ‘Chausa’, ‘Pusa Lalima’, ‘Ratual’, ‘Mallika’ and ‘Alphonso’.

RNA isolation and library preparation for sequencing

Total RNA was extracted from the leaves using Purelink miRNA isolation kit according to manufacturer’s instructions (Invitrogen). Quantification of RNA was done using NanoDrop spectrophotometer and quality check for DNA contamination was done by electrophoresis in 1% denaturing agarose gel. To assess RNA integrity the samples were run on RNA 6000 Pico chip Bioanalyzer (Agilent). The transcriptome library was prepared after quantification and quality check of the Poly(A) RNA using SOLiD total RNA Seq kit for ‘Neelam’ and ‘Dashehari’ and Illumina MiSeq for ‘Amrapali’, respectively.

De novo transcript assembly

High quality sequence reads of cDNA libraries were used to generate transcript shotgun assembly (TSA) contigs. Transcripts for ‘Neelam’ and ‘Dashehari’ were assembled via de novo assembly approach using Velvet-Oasis software, which have been developed for short read assembly of transcriptome data and are based on de-Brujin graph algorithm [33, 34] Longer sequence reads of ‘Amrapali’ were assembled using CLCGenomics Workbench (version 6.5.1) [35]. The assembled TSA contigs of the parents and hybrid were further merged in to a non-redundant set of unigene contigs using CAP3 [36] with default parameters (overlap length cutoff = 40bp and overlap percent identity cutoff = 90%). This unigene set was used for mining of SSRs and for a three way alignment of sequence reads for SNP identification among the three mango genotypes.

SSR detection and primer designing

The final assembled unigene contigs were used for mining of genomic-SSRs using MISA [37] and SSR specific primers were designed using BatchPrimer3 V1.0 [38]. The SSR loci containing repeat units of 2–6 nucleotides only were considered. The criteria for minimum SSR length were defined as 6 reiterations for di-nucleotide SSR and 5 reiterations for tri-, tetra-, penta- and hexa-nucleotide SSRs, mono-nucleotide repeats and complex SSR were excluded [39]. The parameters for designing of primers from SSR flanking sequence were: primer lengths = 20–25 bp; PCR product size = 100–250 bp; annealing temperature = 65°C; GC content = 40–60% with an optimum of 50%; only single consecutive bases of Gs and Cs at the 3’ end of both primers were specified. Remaining parameters were kept at the default setting for BatchPrimer3 V1.0.

SSR marker validation

Genomic DNA was isolated from leaf samples of eight genotypes using CTAB method [40], quantified by UV260 absorbance and adjusted to a final concentration of 30 ng/μl. A set of 100 genic-SSR markers with SSR lengths of 20 bp or above (Type I SSR) were tested for amplification using genomic DNA of ‘Amrapali’ for optimization of the annealing temperature. The PCR reactions were performed in a BioRad Thermal Cycler. Each PCR reaction consisted of 1.5 μl of 10X reaction buffer, 0.20 μl of 10 m MdNTPs (133 μM), 1.5 μl each of forward and reverse primers (10 pMol), and 2.5 μl of template genomic DNA (75 ng), 0.15 μl of Taq DNA polymerase (0.75 U) in a final reaction volume of 15 μl. The PCR reaction profile was denaturation at 94°C for 5 min followed by 35 cycles of 94°C for 1 min, 55°C for 1 min, 72°C for 1 min and finally, 72°C for a final extension of 7 min. Re-screening of primers that did not amplify at these conditions was done by decreasing the annealing temperature sequentially by 1°C, and for the primers producing multiple bands, by increasing the annealing temperature by 1°C [21]. The optimized SSR primers were then used for PCR amplification of multiple varieties of mango. The PCR products were separated by electrophoresis in 4% Metaphor agarose gels (Lonza, Rockland ME USA) containing 0.1 μg/ml ethidium bromide in 1X TBE buffer at 130 V for 4 h. After electrophoresis, PCR products were visualized and photographed using a gel documentation system Fluorchem 5500 (Alfa Innotech Crop., USA). The SSR profiles were scored manually, each allele was scored as present (1) or absent (0) for each of the SSR loci. The SSR markers giving consistent expected size products only (100–250 bp) were used for further analysis of variation.

Annotation and functional classification of the unigene TSA contigs

For the annotation of TSA unigenes, BLAST algorithm [41] was used to search for similarity against a locally configured non-redundant (nr) protein database of NCBI (as on 06 December, 2015) using BLASTX program [42] with cutoff E-value of ≤ 1e-6. The BLASTX result was saved in the.xml format and was imported into Blast2GO software [43] to assign Gene Ontology (GO) terms to the annotated unigene. Blast2GO classified unigenes under three GO terms called cellular component, biological process and molecular function. The GO annotated unigenes with GO terms were exported from Blast2GO in WEGO native format, and online tool WEGO (Web Gene Ontology Annotation Plot) [44] was used for the categorization of annotated unigenes in to three GO categories. For categorization of unigenes into 58 transcription factor families, unigenes were searched against the downloaded protein sequences of Plant Transcription Factor database version 3.0 (PlantTFDB 3.0) [45] using BLASTX with E-value cutoff of ≤1e-6. A COG classification was also performed using the same BLASTX search parameters against NCBI COG databases [46].

Pathway mapping using KAAS

KEGG Automatic Annotation Server (KAAS) [47] was used for Pathway mapping and gene ortholog assignment of the unigenes. The KAAS gives functional annotation of genes by sequence similarity comparison against the manually curated KEGG GENES database [48]. Based on the similarity hits in the KEGG database using BLASTX (default threshold bit-score value of 60), unigenes were assigned with the unique enzyme commission (EC) numbers, and further mapped to the KEGG biochemical pathways.

Multiple sequence alignment and SNP identification

The unigene set was used as reference for mapping of high quality filtered sequence reads from all the three varieties using BWA with default parameters [49]. To track and identify the variety specific reads mapping at a particular location we added the variety name at the end of header of each high quality reads using shell script. SAMtools software was used for conversion of aligned SAM file to BAM file and read sorting [50]. The SNPs were called using software VarScan version 2.7 [51] at highly stringent parameters: 1) minimum 10 reads mapped at each SNP position; 2) average base quality of ≥25; 3) minimum two reads for any SNP base call in each variety. Total mapped reads information including read name, base call with respect to reference position for all identified SNPs were fetched from the duplicate read removed BAM file using shell scripts and the final results were filtered and tabulated.

Results and Discussion

Functional categories of genes expressed in the mango leaves

A total of 60,359,815, 58,212,961 and 4,853,226 raw sequence reads were generated for ‘Neelam’ (mango_N), ‘Dashehari’ (mango_D) and ‘Amrapali’ (mango_A), respectively using two runs of SOLiD sequencing with average read length of 50 bp (mango_N and mango_D) and one run of Illumina Miseq 2x250 (mango_A) with average read length of 250 bp. After quality check, adapter trimming and removal of low quality reads, 53,617,132, 47,818,267 and 4,313,270 high quality reads were retained for ‘Neelam’, ‘Dashehari’ and ‘Amrapali’, respectively. We performed three separate de-novo assemblies using Velvet-Oases assembly pipeline and CLC Genomic workbench 6.5.1 for the SOLiD and Miseq data, respectively. The assembly resulted in 27,528, 20,771 and 35,182 transcripts for mango_N, mango_D and mango_A with transcript N50 values of 557 bp, 451 bp and 591 bp and largest transcript size of 3,129 bp, 2,958 bp and 5,891 bp, respectively (Table 1).

Table 1. NGS sequence and assembly statistics of mango leaf transcriptome varieties Neelam, Dashehari and their hybrid Amrapali.

TSA contigs of individual genotypes were further assembled into 70,057 non-redundant unigenes set, which was used as reference for the identification of genic-SSR and SNP (Fig 1). The mean size of earlier reported 85,651 unigene contigs is 415 bp for the leaf transcriptome of mango variety ‘Langra’ with mean length of 238 bp [52], which is much lower than the present result. Further, sequencing of pooled transcriptome from pericarp and pulp of mango variety ‘Zill’ has resulted in 124,002 transcripts with average size of 838 bp [53]. Transcriptome from mango variety ‘Shelly’ generated 57,544 transcripts with an average length of 863 bp [54], and mesocarp transcriptome of mango variety ‘Kent’ is reported with 80,969 transcripts having mean length of 836 bp and N50 of 1,456 bp [55], which is significantly larger than our results as their transcriptome was sequenced using only Illumina platform which produces longer read length as compare to SOLiD sequenced reads.

Fig 1. Flow diagram of mango transcriptome data analysis.

Assembly, annotation and SSR/SNP identification in three varieties of mango (‘Neelam’, ‘Dashehari’ and ‘Amrapali’).

Raw sequenced data described in this paper can be found in the Sequence Read Archive (SRA) database of the NCBI with SRA Accession number SRR1298995, SRR1297075, SRR1956775 under BioProject Number PRJNA193591, PRJNA193588, PRJNA279829 with TSA Accession number GBVX00000000, GBVW00000000, GEEEC00000000 for ‘Dashehari’, ‘Neelam’ and ‘Amrapali’, respectively.

For the functional annotation 70,057 unigene contigs were searched in the NCBI-nr protein database using BLASTx. As a result 39,798 (56.80%) of the unigenes showed significant similarity to known proteins and were functionally annotated while 30,259 (43.2%) unigenes showed no significant hits. Mango unigenes showed the highest similarity with Citrus sinensis (28.76%), followed by Citrus clementina (17.43%), Theobroma cacao (6.8%), Jatropha curcas (4.44%) and Vitis vinifera (4.19%) (S1 Fig). This result is consistent with the phylogenetic study of mango chloroplast DNA which reported Citrus to be most closely related to Mangifera indica [53].

Gene ontology (GO) terms were assigned successfully to 26,001 of the BLASTX annotated unigenes using BLAST2GO, which were broadly categorized into three main categories; biological process (BP), cellular component (CC) and molecular function (MF) and were further classified into 47 functional groups (Fig 2). The most abundant unigenes were in the biological process category followed by molecular function and cellular components. In the biological process category the highly represented GO terms were “metabolic process”, “cellular process” and “biological regulation” while in molecular function the highest represented GO terms were “catalytic activity” and “binding” whereas in cellular component category the highest represented GO terms were “cell”, “cell part” and “organelle”. Somewhat similar results have been reported earlier for ‘Langra’ and ‘Kent’ leaf transcriptomes with highly represented GO terms “cell” and “cell part” in the cellular component category, “metabolic process” and “cellular process” in biological process and “catalytic activity”, “binding” in molecular function categories [52, 55].

Fig 2. Gene Ontology (GO) classification of annotated mango leaf transcripts.

Out of 70,057 transcripts, a total of 26,001 transcripts were classified into three main GO categories: Biological Processes, Cellular Component and Molecular Function.

Total 8,958 unigenes were assigned Enzyme Commission (EC) numbers; the highly represented enzyme classes were hydrolases (3,335) transferases (3,091) and oxidoreductases (1,437). The large number of unigenes under these three major enzyme groups indicates expression of genes related to secondary metabolite biosynthesis pathway in the mango leaves [52, 53].

KASS server was used for pathway mapping and orthologous gene assignment for the assembled unigenes, and we mapped 4,977 unigenes to 349 different KEGG pathways. The most represented pathway in terms of total number of hits from the transcript data were “ribosome”, “biosynthesis of amino acids”, “carbon metabolism”, “spliceosome”, “purine metabolism” and RNA transport which is quite similar to the results with the transcriptome of mango variety ‘Kent’ which showed the maximum transcripts representations for “biosynthesis of amino acids”, “ribosome” and “RNA transport” [55], but it was different from the results with variety ‘Zill’ [53], where the maximum representation of transcripts was for “metabolic pathways”, “biosynthesis of secondary metabolites”, and “plant-pathogen interaction” pathways (S1 Table). Unigenes belonging to different transcription factor families were identified using local similarity search (BLASTx) against plant transcription factor database (PlantTFDB v3.0) and NCBI COG database. Total 12,539 and 9,847 unigenes showed significant similarities with the PlantTFDB and COG database, respectively and were categorized in to 58 PlantTFDB and 24 COG families. There are three important transcription factor families in plants, namely bHLH, NAC and MYB that have been studied in detail, and here out of the 12,539 significant matches the maximum number of unigenes were categorized in bHLH (1,229) followed by NAC (1,059), MYB (801), WRKY (766) and B3 (763) families of transcription factors (Fig 3). The COG classification of unigenes into different functional cluster of orthologous groups (COG) based on BLASTx search classified unigenes into 24 COG categories. The largest category was of general functions (1676), followed by post-translational modifications, protein turnover, chaperones (1106), translation, ribosomal structure and biogenesis (946), energy production and conversion (643), amino acid transport and metabolism (639) and carbohydrate transport and metabolism (596). The least represented categories were for cell motility (33) and nuclear structure (3), while no unigene was categorized into extracellular structures (Fig 4).

Fig 3. Summary of 12,539 unigenes of M. Indica classified into 58 Transcription Factor (TF) category.

Among them bHLH, NAC, MYB, WRKY proteins were the most abundant.

Fig 4. COG functional classification.

A total of 9,847 unigenes were assigned to 24 COG categories.

Development and validation of genic-SSR markers

A total of 5,465 SSR loci of different categories (mono, di, tri, tetra, penta, hexa and complex) were identified in 4,912 contigs, representing 7.01% of the total 70,057 unigenes. We excluded mononucleotide repeats, complex SSR and those having total lengths of <10 bp because SSR marker based mono-nucleotide repeats are not reproducible due to recombination slippage and PCR amplification problems, whereas complex SSRs show the least polymorphism [39]. This exclusion left only 1,481 Type I SSRs for primer design. Among the SSR containing unigenes, 1,438 (95.52%) unigenes possessed a single SSR locus, while 43 (4.44%) unigenes had two or more SSR loci each. As expected for coding sequences, tri-nucleotides were the most common repeat units representing 774 (52.26%) of the total filtered SSR, followed by di-nucleotide 641 (43.28%), tetra-nucleotide 47 (3.17%), penta-nucleotide 12 (0.82%) and hexa-nucleotide 7 (0.47%) repeats. Maximum percentage of SSR repeats constituted by tri-nucleotide and di-nucleotide 985 (71.80%) while only 387 (28.20%) of SSR constituted of tetra-nucleotide, penta-nucleotide and hexa-nucleotide repeats. The most abundant SSRs were with five reiterations, the frequency of a given SSR structure and the number of repeat units in it showed an inverse relationship (S2 Fig). Hence, SSR loci with less than five repeats are expected to be even more abundant but were not included in the present investigation because they would be less useful in the study of detectable polymorphism [39]. SSR motifs showing more than ten reiterations were rare with a frequency range of 1.28%–0.06%. Total 133 distinct repeat motifs were identified in 1,438 genic-SSRs, the 11 most frequent motifs are shown in (S2 Table). Di-nucleotide repeats AG/CT, AT/AT and AC/GT were the most abundant SSR with frequencies of 19.04%, 17.95% and 6.65%, respectively. Among the tri-nucleotide repeats, AAG/CTT and ATC/GAT were the most abundant with frequencies of 17.75% and 8.33%, respectively.

PCR primers were designed successfully from the unique sequences flanking 1,069 SSR loci for the development of genic-SSR markers and were designated MSSR1 to MSSR1069 (M = Mango). Primers could not be designed for the remaining 4,396 SSR loci because their flanking sequences were either too short or the nature of sequence did not fulfill our criteria for primer design. Of the 1,069 SSR markers, 227 type I SSR loci (n ≥ 20 bp) (S3 Table) were filtered out and of this primers were synthesized for 100 loci for validation due to their high chance of showing polymorphism in agarose gel electrophoresis [56]. Out of the 100 synthesized primer pairs, 43 yielded PCR amplicon of expected size and were designated as “validated genic-SSR markers” (Table 2, Fig 5a). In addition, 36 primer pairs amplified ≥3 bands and 21 primer pairs failed to amplify even when the annealing temperature was reduced by 7°C (Fig 5b). All the amplified genic-SSR markers were scored for their amplicon size in eight varieties. Although a large proportion of the SSR loci were monomorphic, some of these will show polymorphism on analysis of a larger set of varieties. These markers have already been utilized successfully for diversity analysis in among 96 mango cultivars in a separate study [57]. Further, use of more sensitive techniques for DNA fragment size analysis, e.g. polyacrylamide gel electrophoresis or capillary electrophoresis, is also expected to show a higher rate of polymorphism.

Table 2. Details of 43 validated type I genic-SSR markers tested for polymorphism among 8 mango varieties.

Fig 5. Wet lab validation of in silico designed genic-SSR markers of mango.

a) PCR results with 48 different SSR markers (MSSR1- MSSR 48) in mango variety Amrapali; b) Allelic polymorphism of MSSR-13 in 8 different varieties of Mango 1. Neelam, 2. Dashehari, 3. Amrapali, 4. Chausa, 5. Pusa Lalima, 6. Ratual, 7. Mallika and 8. Alphonso.

SNP heterozygosity in mango varieties Neelam, Dashehari and their hybrid Amrapali

The non-redundant set of 70,057 transcripts was used as a reference for mapping of high quality reads from parents (‘Neelam’ and ‘Dashehari’) and hybrid (‘Amrapali’) and total 42,984 SNP positions were identified with full details saved as text file. In-house shell scripts was written which reads the transcript name and its SNP position from this text file, and extracts all the mapping reads information at each SNP position, including the full read name, base call in the mapped read at SNP position and position of the SNP in the reads as well as in the reference transcript. The results were tabulated in an Excel sheet, which included variety wise information on heterozygosity at each SNP position. This helped identify level of heterozygosity in the parents and hybrid ‘Amrapali’ using stringent criteria. After quality filtration we identified 22,306 SNPs in 10,571 transcripts common to all the three genotypes with an average of 2.1 SNPs and a range of 1–16 SNPs per contig (Fig 6).

Fig 6. Frequency distribution of number of SNPs in mango leaf transcripts.

Single SNP per transcripts were most abundant.

We classified the SNPs into two categories: i) homozygous within a variety and ii) heterozygous within a variety, and found that in ‘Neelam’ the proportion of heterozygous SNPs was 49.3% while in ‘Dashehari’ it was only 30.19%. Interestingly, in their hybrid Amrapali the heterozygosity level increased to 64.5% (Table 3). Further, the 22,306 SNPs were classified in to eight categories on the basis of polymorphism in the hybrid ‘Amrapali’ vis-a-vis in its parental lines ‘Neelam’ and ‘Dashehari’ at the same position. ‘Amrapali’ showed 6,831 novel heterozygous SNP loci that were homozygous for contrasting alleles in the two parental varieties, which could be the basis of its superior performance. In addition, ‘Amrapali’ has maintained heterozygosity at 7,561 SNP loci that are heterozygous either in ‘Neelam’ (4,158 loci) or ‘Dashehari’ (2,049 loci) or both the parents (1,354 loci). On the other hand there were 6,760 SNP loci, which were heterozygous in either one or both the parents but became homozygous in ‘Amrapali’ (Table 3). The loss of heterozygosity from ‘Neelam’ was more than twice as compared to ‘Dashehari’, which is consistent with the higher level of heterozygosity in ‘Neelam’. Surprisingly, there were 2,061 SNP loci that were homozygous for contrasting alleles in the two parents but showed homozygosity in ‘Amrapali’, this was clearly due to lack of sufficient number of sequence reads from ‘Amrapali’ at these positions.

Table 3. Categorization of 22,306 SNPs into eight classes on the basis of heterozygosity in the hybrid Amrapali vis a vis its parental varieties Neelam and Dashehari.


In this study we have presented development and validation of a comprehensive set of genic-SSR and SNP markers in mango by deep sequencing of leaf transcriptome using next generation sequencing technology. Considering the need to generate large number of molecular markers for mango breeding applications, a set of 288 type I genic-SSR markers were developed, and a subset of these was validated successfully for robust amplification in eight mango varieties. These markers can be used for diversity analysis and genetic mapping of useful horticultural traits in mango. In addition, an analysis of 22,306 SNP loci in hybrid mango clone ‘Amrapali’ and its parental lines ‘Neelam’ and ‘Dashehari’ revealed substantially higher heterozygosity in ‘Amrapali’. Among the parental lines ‘Neelam’ showed significantly higher level of heterozygosity than ‘Dashehari’. The identified genic-SNPs provide much-needed resource for the development of high density, cost-effective genotyping assays which would greatly help the mango breeding programs and genome wide association studies for the yield and quality traits in mango.

Supporting Information

S1 Fig. Plant species showing top BLASTx hit with mango transcripts.

Mango transcripts showed highest similarity with Citrus sinensis (28.76%), Citrus clementina (17.43%), Theobroma cacao (6.8%), Jatropha curcas (4.44%) and Vitis vinifera (4.19%).


S2 Fig. The frequency of SSR structure and the number of repeat units.

The most abundant SSRs were with five reiterations, the frequency of a given SSR structure and the number of repeat units in it showed an inverse relationship.


S1 Table. List of transcripts with KEGG Pathway Id and their categorization into various functional classes.


S2 Table. Frequency distribution of SSR loci with different repeat motifs and number of repeats in the unigenes.


S3 Table. List of genic-SSR markers along with their flanking primer sequences developed from mango leaf transcriptome unigenes.



We are thankful to ICAR- National Research Centre on Plant Biotechnology, New Delhi, India and The Division of Fruits and Horticultural Technology, ICAR-Indian Agricultural Research Institute, Pusa, New Delhi, India for facilities and support.

Author Contributions

  1. Conceptualization: NKS AKM.
  2. Data curation: AKM NS AS.
  3. Funding acquisition: NKS AKS.
  4. Methodology: AKM.
  5. Project administration: NKS AKM.
  6. Resources: NKS AKS MS J SKS TRS.
  7. Software: AKM.
  8. Supervision: NKS.
  9. Validation: AKM NS.
  10. Visualization: AKM.
  11. Writing – original draft: AKM NS.
  12. Writing – review & editing: NKS TRS AKM.


  1. 1. Mukherjee SK. Cytological investigation of the mango (Mangifera indica L.) and the allied Indian species. Proceedings of National Institute of Science of India, New Delhi, 1950; 16:287–303.
  2. 2. Arumuganathan K, Earle ED. Nuclear DNA content of some important plant species. Plant Mol. Biol. 1991; Rep. 9: 208–218.
  3. 3. FAOSTAT. Available online: (accessed on 15 November 2013).
  4. 4. Tharanathan RN, Yashoda HM, Prabha TN. Mango (Mangifera indica L.), “The King of Fruits”—An Overview. Food Rev. Int. 2006;22:37–41.
  5. 5. Singh NK, Mahato AK, Sharma N, Gaikwad K, Srivastava M, Tiwari K, et al. A draft genome of the king of fruit, mango (Mangifera indica L.). Plant and Animal Genome XXII Conference. 2014.
  6. 6. Singh NK, Mahato AK, Jayaswal PK, Singh A, Singh S, Singh N, et al. Origin, Diversity and Genome Sequence of Mango (Mangifera indica L.) Indian Journal of History of Science. 2016; 51.2.2:355–368.
  7. 7. Mahattanatawee K, Manthey JA, Luzio G, Talcott ST, Goodner K, Baldwin EA. Total antioxidant activity and fiber content of select Florida-grown tropical fruits. J. Agric. Food Chem. 2006; 54 (19): 7355–63. pmid:16968105
  8. 8. Rocha Ribeiro SM, Queiroz JH, Lopes Ribeiro de Queiroz ME, Campos FM, Pinheiro Santana HM. Antioxidant in mango (Mangifera indica L.) pulp. Plant Foods Hum Nutr. 2007; 62 (1): 13–7. pmid:17243011
  9. 9. Schnell RJ, Knight RJ. Frequency of zygotic seedlings from five polyembryonic mango rootstocks. Hort. Sci. 1992;27: 174–6.
  10. 10. Truscott M. Biochemical screening of polyploid mango seedlings. CSFRI Information Bulletin. 1992;237: 17–8.
  11. 11. Degani C, Cohen M, El-Batsri R, Gazit S. PGI isozyme diversity and its genetic control in mango. Hort. Sci. 1992;27: 252–4.
  12. 12. Degani C, Cohen M, Reuvani O, El-Batsri R, Gazit S. Frequency and characteristic of zygotic seedlings from polyembryonic mango cultivars determined using isozymes as genetic markers. Acta Horticulturae. 1993;341: 78–85.
  13. 13. Hirano R, Htun Oo T, Watanabe KN. Myanmar mango landraces reveal genetic uniqueness over common cultivars from Florida, India, and Southeast Asia. Genome. 2010; 53(4):321. pmid:20616863
  14. 14. Khan IA, Azim MK. Variations in intergenic spacer rpl20- rps12 of Mango (Mangifera indica) chloroplast DNA: implications in cultivar identification. Plant Evol Syst. 2011; 292(3–4):249–255.
  15. 15. Ravishankar KV, Mani BH, Anand L, Dinesh MR. Development of new microsatellite markers from mango (Mangifera indica) and cross-species amplification. American Journal of Botany. 2011;98: e 96 –e99. pmid:21613158
  16. 16. Souza IG, Valente SE, Britto FB, de Souza VA, Lima PS. RAPD analysis of the genetic diversity of mango (Mangifera indica) germplasm in Brazil. Genet. Mol. Res. 2011; 10(4):3080–3089. pmid:22194163
  17. 17. Srivastava N, Bajpai A, Chandra R, Rajan S, Muthukumar M, Srivastava MK. Comparison of PCR based marker systems for genetic analysis in different cultivars of mango. J. Environ. Biol. 2012; 33(2):159–166. pmid:23033674
  18. 18. Buschiazzo E, Gemmell NJ. The rise, fall and renaissance of microsatellites in eukaryotic genomes. BioEssays. 2006;28: 1040–1050. pmid:16998838
  19. 19. Kelkar YD, Tyekucheva S, Chiaromonte F, Makova KD. The genome-wide determinants of human and chimpanzee microsatellite evolution. Genome Research. 2008;18: 30–38. pmid:18032720
  20. 20. Csencsics D, Rodbeck SB, Holderegger R. Cost-effective, species-specific microsatellite development for the endangered dwarf bulrush (Typha minima) using next-generation sequencing technology. Journal of Heredity. 2010;101: 789–793. pmid:20562212
  21. 21. Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V,et al. Development of genic-SSR markers by deep transcriptome sequencing in pigeonpea [Cajanus cajan (L.) Millspaugh]. BMC Plant Biology. 2011;11: 17–29. pmid:21251263
  22. 22. Van Deynze A, Stoffel K, Buell CR, Kozik A, Liu J, van der Knaap E, et al. Diversity in conserved genes in tomato. BMC Genomics. 2007; 8:465. pmid:18088428
  23. 23. Van Deynze A, Stoffel K, Lee M, Wilkins TA, Kozik A, Cantrell RG, et al. Sampling nucleotide diversity in cotton. BMC Plant Biol. 2009; 9:125. pmid:19840401
  24. 24. Ashrafi H, Hill T, Stoffel K, Kozik A, Yao JQ, Chin-Wo SR, et al. De novo assembly of the pepper Transcriptome (Capsicum annuum): A benchmark for in silico discovery of SNPs, SSRs and candidate genes. BMC Genomics. 2012; 13:571. pmid:23110314
  25. 25. Edwards’s D, Batley J. Plant genome sequencing: applications for crop improvement. Plant Biotechnol. J. 2010;8: 2–9. pmid:19906089
  26. 26. Zalapa JE, Cuevas H, Zhu H, Steffan S, Senalik D, Zeldin E, et al. Using next-generation sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. American Journal of Botany. 2012; 99 (2): 193–208. pmid:22186186
  27. 27. Ekblom R,Galindo J. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity.2011; 107: 1–15. pmid:21139633
  28. 28. Li YC, Korol AB, Fahima T, Beiles A, Nevo E. Microsatellites: Genomic distribution, putative functions, and mutational mechanisms: A review. Molecular Ecology. 2002;11: 2453–2465. pmid:12453231
  29. 29. Kircher M, Kelso J. High-throughput DNA sequencing—concepts and limitations. Bioessays. 2010; 32(6):524–36. pmid:20486139
  30. 30. Barabaschi D, Guerra D, Lacrima K, Laino P, Michelotti V, Urso S, et al. Emerging knowledge from genome sequencing of crop species. Mol. Biotechnol. 2012;50: 250–266. pmid:21822975
  31. 31. Egan AN, Schlueter J, Spooner DM. Applications of next-generation sequencing in plant biology. Amer. J. Bot. 2012;99: 175–185. pmid:22312116
  32. 32. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. pmid:18262675
  33. 33. Zerbino D, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18:821–829. pmid:18349386
  34. 34. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012; Apr 15; 28(8):1086–92. pmid:22368243
  35. 35.
  36. 36. Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. 1999; Sep;9(9):86. pmid:10508846
  37. 37. (15 June 2012, date last accessed).
  38. 38. You FM, Huo N, Gu YQ, Luo MC, Ma Y, Hane D, et al. BatchPrimer3: A high throughput web application for PCR and sequencing primer design. BMC Bioinformatics. 2008; 9:253. pmid:18510760
  39. 39. Dutta S, Mahato AK, Sharma P, Raje RS, Sharma TR, Singh NK. Highly variable ‘Arhar’ simple sequence repeat markers for molecular diversity and phylogenetic studies in pigeonpea [Cajanus cajan (L.) Millisp.] Plant Breeding. 2012;132(2):1439–0523.
  40. 40. Doyle JJ, DOYLE JL. Isolation of plant DNA from fresh tissue. Focus (San Francisco, Calif.) 1990; 12: 13–15.
  41. 41. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. pmid:2231712
  42. 42. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389–3402. pmid:9254694
  43. 43. Conesa A, Getz S, Garcia-Gomez JM, Terol J, Talón M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005; 21(18):3674–3676. pmid:16081474
  44. 44. Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, et al. WEGO: a web tool for plotting GO annotations. Nucleic Acids Res. 2006; 34: W293–W297. pmid:16845012
  45. 45. Jin J, Zhang H, Kong L, Gao G, Luo J. PlantTFDB 3.0: a portal for the functional and evolutionary study of plant transcription factors. Nucleic Acids Res. 2014; 42: D1182–1187. pmid:24174544
  46. 46. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003; 4(1):41. pmid:12969510
  47. 47. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Research. 2007; 35(Web Server issue):W182–W185. pmid:17526522
  48. 48. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000; 28, 27–30. pmid:10592173
  49. 49. Li H, Durbin R. Fast and accurate short read alignment with Burrows—Wheeler transform. Bioinformatics. 2009; 25(14):1754–1760. pmid:19451168
  50. 50. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25: 2078–2079. pmid:19505943
  51. 51. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research. 2012; 22: 568–576. pmid:22300766
  52. 52. Azim KM, Khan IA, Zhang Y. Characterization of mango (Mangifera indica L.) Transcriptome and chloroplast genome. Plant Mol Biol. 2014; 85:193–208 pmid:24515595
  53. 53. Wu H, Jia H, Ma X, Wang S, Yao Q, Xu W, et al. Transcriptome and proteomic analysis of mango (Mangifera indica Linn) fruits Journal of proteomics. 2014; pmid:24704857
  54. 54. Luria N, Sela N, Yaari M, Feygenberg O, Kobiler I, Lers A, et al. De-novo assembly of mango fruit peel transcriptome reveals mechanisms of mango response to hot water treatment. BMC Genomics. 2014; 15:957. pmid:25373421
  55. 55. Mitzuko DC, Adrian OL, Carmen Arminda CV, Magda Adelina PS, Sergio CF, Alejandro SF, et al. Mango (Mangifera indica L.) cv. Kent fruit mesocarp de novo transcriptome assembly identifies gene families important for ripening. Frontiers in Plant Science. 2015; Vol-6. pmid:25741352
  56. 56. Singh H, Deshmukh RK, Gaikwad K, Sharma TR, Mohapatra T, Singh NK. Highly variable SSR markers suitable for rice genotyping using agarose gels. Mol Breeding. 2010; 25:359–364.
  57. 57. Lal S. Identification of genomic regions for alternate bearing and fruit quality traits in mango (Mangifer indica L.). 2016; Ph.D. thesis. Fruits and Horticultural Technology, P.G. School, ICAR- Indian Agriculture research Institute, New Delhi, pp 1–93.