Development and characterization of a new set of genomic microsatellite markers in rice bean (Vigna umbellata (Thunb.) Ohwi and Ohashi) and their utilization in genetic diversity analysis of collections from North East India

Rice bean [Vigna umbellate (Thumb.) Ohwi and Ohashi] is an underutilized crop believed to be domesticated in the Myanmar-Thailand region of Asia. In India, rice bean is mainly cultivated in the North-Eastern Hills, which is a hotspot for biological diversity. A 5' anchored PCR was used to develop microsatellite markers in rice bean. Twenty-eight specific primer pairs were designed and employed to characterize sixty five ricebean accessions collected from North East India. A total of 179 alleles were amplified with an average of 6.393 alleles per locus. The gene diversity was high (mean 0.534) in the accessions collected from Darjeeling, Nagaland and Manipur, which are bordering areas with East Nepal and Myanmar, respectively. Exceptionally high outcrossing rate was observed in the entire population. Population structure analysis identified three distinct clusters in which accessions collected from areas bordering Myanmar and East Nepal grouped separately. Using a combination of STRUCTURE and Principal Coordinate Analysis, relative affinity of the intermediate accessions could be established. However, differences in allelic counts among populations were non-significant. The results showed that there is a high level of genetic diversity within the accessions, with high outcrossing rate.


Introduction
Rice bean [Vigna umbellate (Thunb) Ohwi and Ohashi] is a less utilized grain legume cultivated mainly in Nepal, Bhutan, North East India up to Myanmar, Southern China, Northern a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Thailand, Laos, Vietnam, Indonesia and East Timor [1]. It is mainly grown under shifting cultivation and used as vegetable, pulse, fermented pulse, fodder and in folk medicine. It is one of the eight Vigna species domesticated in Asia, and is very closely related to adzuki bean [V. angularis(Willd.) Ohwi and Ohashi] [2,3,4]. The original center of domestication of the Asian Vignaspecies is thought to be the Indo China region, and they belong to the subgenus Ceratotropis, comprising 3 sections, Ceratotropis, Aconitifoliaeand Angulares. Angularesis the most recent and diversified section of the subgenus [5], to which rice bean belongs. Rice bean is believed to be domesticated from a wild form V. umbellate vargracilis, which is a cross fertile type. These wild types have been reported to occur in natural and disturbed habitats and are of indeterminate, photoperiod-sensitive, freely-branching, twining plant types with small seeds. Most rice bean landraces cultivated in North Eastern India are similar to these wild types.
In India, as a cultigen its distribution is mainly confined to the tribal regions of North Eastern hills and hilly tracts of Western and Eastern Ghats [6]. In the North Eastern Hill Region of India (Arunachal Pradesh, Manipur, Meghalaya, Mizoram, Nagaland, Sikkim, and Tripura) rice bean is grown predominantly under rainfed conditions in a mixed farming system, under shifting cultivation or in kitchen gardens and backyards. It is grown on a limited scale in the eastern and western peninsular tracts of India.
In a study of the genetics of domestication of rice bean, Isemura et al. [7] reported high collinearity between adzuki bean and rice bean in the order of molecular markers. Similarity in the domestication genes between the genome of the two species was also reported in the same study. They concluded that the elite genes of rice bean hold great potential for its improvement as well as its allied species. However, improvement and utilization of rice bean through conventional or molecular breeding still lag behind other crops like cowpea and mung bean in the genus Vigna [8].Genetic diversity studies with a limited number of accessions from China, Nepal and Thailand have been reported earlier [9,10,11]. Wang et al. [8] studied genetic diversity in 230 accessions of rice bean collected from 12 provinces of China using mung bean SSRs. They reported a clustered yet generalized distribution of the accessions suggesting a wide exchange of germplasm across the provinces. However, the most comprehensive study so far with a very wide range of cultivated and wild rice bean accessions is by Tian et al. [1]. They studied 472 accessions (cultivated and wild) from 16 Asian countries and reported high gene diversity (>0.5) in the South and South East Asian accessions. In all these studies, number of Indian germplasm was negligible. Muthusamy et al. [12] studied genetic diversity in 10 rice bean germplasm collected from Meghalaya, a North Eastern Himalayan state of India and reported high genetic diversity. Studies of genetic diversity with a considerable number of germplasm from Indian Himalayan region, where rice bean is mainly cultivated, has not been conducted.
Simple sequence repeats (SSRs) or microsatellites are motifs of short tandem repeats that may vary in the repeat units at a given locus. SSRs have specific advantages over other markers as they occur frequently and randomly in all eukaryotic DNA [13], are multiallelic and are amenable to both manual scoring and automation [14]. SSRs have been used in diversity analysis [15,16], genetic fidelity analysis [17], gene mapping and species identification [18] and linkage analysis [19].
Among different crop species, the frequencies and occurrence of the two most common dinucleotide repeats {(AC)n and (GA)n} have been worked out in relatively greater detail [20]. Trinucleotide and tetranucleotide repeats are also found in plant genomes, the most frequent among them being (AAG)n, and (AAT)n [13]. In another survey of published DNA sequences in 54 plant species, Wang et al. [21] observed that the (AT)n, sequences were the most abundant followed by (A)n, (AG)n, (AAT)n, (AAC)n, (AGC)n, (AAG)n, (AATT)n, (AAAT)n and (AC)n.
Many different strategies for obtaining microsatellite DNA loci have been described. The simplest approach; cloning small genomic fragments and using radiolabeled oligonucleotide probes of microsatellite repeats to identify clones with microsatellites, was the first described and work well in organisms with abundant microsatellite loci [22,23,24]. Unfortunately, this approach does not work properly when microsatellite repeats are less abundant. Later, two types of enrichment strategies were developed: 1) uracil-DNA selection [25] and 2) hybridization capture [26,27]. Hybridization capture is the predominant strategy in use because it allows selection prior to cloning, and therefore, is faster and easier to do with multiple samples than uracil-DNA selection, which requires passage of each library through two bacterial strains. With the availability of largescale sequence database, development of microsatellite markers from expressed sequence tags (EST) has now become routine in many laboratories as it is fast and does not require elaborate facilities. Introduction of high-throughput next-generation DNA sequencing made development of microsatellite markers much easier and faster. While the EST-based SSRs still suffer from the drawback that gene / genes not expressing at a particular organ / time are not included, high-throughput DNA sequencing requires substantial investment. The 5'-anchored polymerase chain reaction [28,29] is one of the simplest and fast methodsfor developing SSR enriched genomic libraries and subsequent isolation of microsatellites that does not require substantial investment.

Plant materials and DNA extraction
Sixty-five accessions of rice bean collected from six North Eastern Himalayan states of India (Fig 1) were used for the study. All genotypes used for the diversity analysis are described in Table 1. These were maintained at the Division of Plant Breeding (re-named Crop Improvement), ICAR Research Complex for NEH Region, Umiam, Meghalaya. The total genomic DNA was extracted from leaves of 14-day old seedlings (one per accession) using the Plant DNA Isolation Kit (GCC Biotech, India) following the protocol of the manufacturer. DNA sample concentration was determined using a NanoDrop2000 (Thermo Fisher Corporation Inc.), and the DNA samples were diluted to 10 ng/μl prior to polymerase chain reaction (PCR) amplification.

Microsatellite library construction by 5' anchored PCR
A microsatellite library was constructed using the method of Fisher et al. [28]. Briefly, degenerate primers designed to anchor at the 5' end of microsatellite sequences were used to amplify genomic DNA. Di-repeated primers (PCT2 and PCT4) containing 7 degenerate bases at the 5' end of repeats were comprised of i) PCT2 -[5'-KKVRVRV(CT) 15 −3'] and ii) PCT4 -[5'-KKRVRVR(CT)6−3'] (where K = G/T,V = G/C/A and R = G/A) were used in PCR amplification to isolate a range of repeat motifs. PCR amplification was carried out in a total volume of 20 μl containing 50 ng of template DNA, 2.5 mM MgCl 2 , 1 X PCR buffer containing 10 mM Tris-HCl (pH = 8.0), 50 mM KCl, 1 unit Taq DNA Polymerase (Prime Taq, GCC Biotech, India) 10 pmol of degenerate primer (PCT2/PCT4) and 0.2 mM of dATP, dGTP, dCTP and dTTP. Amplification was carried out in a thermocycler (Genamp PCR 9700, Applied Biosystem), with an initial denaturation of 94˚C for 3 minutes followed by 35 cycles of 94˚C for 30 secs, 58.4˚C (PCT2)/ 54.5˚C (PCT4) for 40 secs, one minute at 72˚C and a final extension step of 72˚C for 5minutes. Successful amplification was confirmed by electrophoresing the PCR products on 0.8% agarose gel with a 1kb ladder (GCC Biotech, India) as size standard. The PCR products were afterwards purified using a PCR Purification Kit (GCC Biotech, India), cloned into pGEM1-T vector (Promega, Madison, USA) and a library was made. One hundred and twenty four random clones were then sequenced using Bigdye Terminator v3.1 Cycle Sequencing Ready Reaction kit (Thermo Fisher Corporation Inc.) on an ABI 310 genetic analyzer. Microsatellite containing sequences and natures of microsatellites were identified and analyzed using the software MISA [30] and specific primers were designed using Primer3 [31].

Marker analysis
Primer modelling was possible for 53 primers pairs, which were then synthesized (GCC Biotech, India) and checked for amplification in a set of 12 accessions (2 from each population). Twenty-eight primer pairs that consistently gave unambiguous scorable single bands in all the 12 accessions were later resynthesized with fluorescent labeling (S1 Table) on the forward primer.

Fluorescently-labeled microsatellite marker analysis
Individual PCR amplifications for each microsatellite were performed using the Genamp PCR 9700 thermocycler (Applied Biosystems.). The PCR amplification was carried out in a total volume of 10 μl containing 10 ng of template genomic DNA, 1.5 mM MgCl 2 , 1 X PCR buffer containing 10 mM Tris-HCl (pH = 8.0) and 50 mM KCl, 0.5 unit of Taq DNA Polymerase (GCC Biotech, India), two pmol of flourscent dye labeled forward primer, two pmol of unlabeled reverse primer and 0.1 mM of dATP, dGTP, dCTP and dTTP. The PCR profile was: 94 o C for 5 minutes followed by 35 cycles of 94 o C for one minute, 50.0˚C-60˚C (depending on primer) for one minute, 72 o C for two minutes, and a final extension step of 72˚C for 10minutes. The amplified PCR products were analyzed in an ABI Prism 310 genetic analyzer (Thermo Fisher Corporation Inc.) using GeneScan TM -500 LIZ as size standard.

Statistical analysis
Alleles were manually assigned to regular fluorescence peaks, and the sizes of the alleles were calibrated by the GeneMapper Software version 4.0 against GeneScan TM -500 LIZ Size Standard. Autobin v 0.9 software (Franck Salin, INRA Pierroton-UMR BIOGECO) was used to analyze the raw data. For statistical analysis, the accessions were grouped into six population viz. P1 = Arunachal Pradesh; P2 = Manipur; P3 = Nagaland, P4 = Sikkim; P5 = Darjeeling; P6 = Meghalaya according to their geographic location (state of India except Darjeeling, which is a district) of collection. Allelic patterns, analysis of molecular variance (AMOVA), principal co-ordinate analysis (PCoA) and Nei [32] genetic distances were calculated using GenAlex software [33]. Population wise expected (He) and observed (Ho) heterozygosity, f (fixation index) values were calculated using gDA v1.0 software [34]. Out crossing rate was calculated from fixation index using the equation t = (1-f)/(1+f) [1].The number of alleles/locus, major allele frequency, gene diversity and PIC values were calculated using Power marker version 3.25 software [35]. Genetic distance was calculated using C.S. Chord distances, and phylogenetic tree was constructed based on neighbor joining (NJ) method as in the Power Marker Software. The tree was viewed using the software Tree view v1.6.6 (http://taxonomy.zoology. gla.ac.uk/rod/treeview.html). Population structure was analyzed using the software STRUC-TURE V.2.3.4 [36,37] as described by Roy et al. [15] except that initially 40 replicate runs were done for K = 1-12, and the second run was done at K = 3, K = 6 and K = 8 with 40 replications. Finally, at K = 3, the membership co-efficient from the run with the highest LnP(D) value (-3940.2) was used to assign the accessions to cluster 1-3 (SP1-3) based on their highest membership co-efficient for a cluster. Accessions with <0.6 membership co-efficient for all the clusters were grouped as admixtures (SP4). For the structure clusters (SP1-4), allelic pattern, genetic distance, AMOVA and PCoA were calculated using the GenAlEx V6.5 [33].For the geographic location-based and Structure-based populations, significance of difference of allelic count (FPTest) was tested using FPTestR software [38].

Results and discussion
Characterization of microsatellite loci DNA sequencing of the clones indicated the presence of a total of 124 microsatellite repeats ( Table 2). Di-nucleotide repeats were the most abundant (79%) followed by penta-nucleotide repeats (17.7%) and tri-nucleotide repeats (3.2%). CT (48.98%) and GA (46.9%) were the most abundant type. Repeat units in the dinucleotide varied from 9-25 (both in AG). However, 16 units was the most common (40 in CT and 38 in GA). Among the pentanucleotides, TAAAA motif was the most abundant (14 of 22). A total of 124 clones yielded fragments containing 49 dinucleotides, two trinucleotides and two pentanucleotides. Primer modelling was possible for 53 microsatellite loci out of which 28 microsatellite loci were found to give unambiguous scorable bands. These were selected for further analysis. Details of the developed SSR primers with Tm, expected product size and dye labeling used for our study are given in the S1 Table. All the SSRs showed polymorphism and average alleles per locus (NA) was 6.39 (Table 3) within a range of 2 (VUCT 18) to 13 (VUCT 19). Gene diversity ranged from 0.03 (VUCT26) to 0.784 (VUAG09) with a mean of 0.532. Observed heterozygosity ranged from 0.062 (VUGA 25) to 1.00 (VUTG 21) with a mean of 0.528. Mean Results of the characterization of the 6 geographic (location of collection) populations with the microsatellite markers are given in Table 4. Highest average number of allele/locus (4.750)  Table 5. Number of alleles with a frequency of >5% were lower than the number of different alleles (alleles/locus) in the Sikkim (Pop5) and Meghalaya (Pop6) populations. Shannon's Information Index (I) and alleles specific to a population (private allele) were the highest in the Meghalaya population (1.019 and 0.929 respectively, Table 5). Primers 17,18, and 20 showed the highest number of private allele (4 each) while primer 1 and 5 showed the lowest number of private allele (1 each). Highest number of private allele was seen in the genotypes 28, 31, 60 and 64 (3 each).Principal coordinate Analysis (PCoA) showed that the 65 genotypes could be divided amongfour groups (Fig  2A). Manipur (Pop2) and Darjeeling (Pop5) populations showed clustering in two distinctly separated groups. First three axes of the coordinate described 28.32% of the variation. AMOVA showed that 98% of the variance was within population variance ( Fig 2B).
Analysis of the population using the model-based approach of the STRUCTURE software indicated the presence of structured population. Initially, the analysis was performed for evaluating 1 to 10 clusters (K) with five replicated runs per K value. 'Structure Harvester' program (http://taylor0.biology.ucla.edu/) was used to determine the optimum K value based on both LnP(D) and Evanno'sΔK [39]. Two peaks were obtained, one at K = 3 (major peak) and the other at K = 5 (S1A Fig). Considering both the peaks, a second analysis was done with K = 2 to 6, length of burnin period 10,000, number of MCMC repeats 100,000, number of replicate runs 10 and keeping other conditions the same. Analysis with the Structure Harvester program produced a minor ΔK peak at K = 3 and a major peak at K = 5 (S1B Fig). To overcome this problem, the method suggested by Gante et al. [40] was used with slight modification. For the analysis, K = 1 to 12, burn in period 5000, MCMC repeats 50,000 and replicates 40 was were used. Structure harvester produced a clear ΔK peak at K = 3 with the lowest SD value for LnP (D). Two minor peakswere also seen atK = 6 and K = 8 (S2 Fig). Three further runs with K = 3, 6 and 8 were done keeping length of burnin period 10,000, MCMC repeats 100,000 and 40 replicates. Analysis of results in the runs with the highest LnP(D) values showed that atK = 3, nearly 50% of the accessions could be assigned to one of the three SPs considering a membership coefficient of !0.6 for a SP. However, at K = 6 and K = 8, keeping the same parameter, no accession could be assigned to several SPs. Therefore, the run with the highest LnP(D) value at K = 3 was used for assignment of accessions to SPs. The LnP(D) and Evanno'sΔK method supported the presence of 3 clusters (K = 3), represented here as SP1, SP2 and SP3, respectively. The inferred population structure is given in the Fig 3. SP1 contained 16 individuals (24.6% of the total accessions), dominated by accessions from Meghalaya (5) and followed byNagaland, Sikkim, Darjeeling (3 each) and Arunachal Pradesh (2). However, no accession from Manipur was included in this cluster. SP2 contained 11 individuals (16.9% of the total) with 3 accessions each from Meghalaya and Sikkim and 2 accessions from Darjeeling. One accession each from Arunachal Pradesh, Manipur and Nagalandwere also included in this cluster. SP3 was the smallest (5 accessions) representing 7.7% of the accessions studied. Three accessions from Meghalaya and 1 each from Manipur and Nagaland were included in this cluster.
The three clusters (SP1-SP3) had population specific Fst values (as calculated in STRUC-TURE) of 0.400, 0.441 and 0.650, respectively with an average of 0.497indicating a low to moderate difference among clusters. Net nucleotide distance among SPs was highest between SP1 and SP2 (0.130) and lowest between SP1 and SP3 (0.052). At K = 3, 32 accessions were admixture (<0.6 membership co-efficient for any cluster).
Mean allele pattern across the SPs is given in Table 6. For this calculation, admixtures were treated as a separate population (SP4). Allele frequency was highest in SP1 (3.893). Shanon's information index and number of private alleles were higher in the SP3. PCoA also supported the presence of three clusters. Four prominent groups were seen (Fig 4A). Accessions of SP1 and SP2 showed clear grouping while SP3 was spread across two groups. The first three axes explained 28.32% of the variation. AMOVA showed that 85% of the variance was due to within population variance (Fig 4B). Theneighbor joining (NJ)tree (Fig 5) also supported presence of three clusters. SP1 and SP2 showed clear grouping while two accessions of SP3 grouped alongwith SP1. FPTest [38] among different groups of geographic location-based populations or SPs indicated non-significant differences in allelic counts.
In the recent years, a lot of interest has been shown in the assessment of genetic diversity in rice bean and development of molecular markers for the purpose. AFLP [41], RAPD and ISSR [12] were used earlier for assessment of genetic diversity in rice bean. Tian et al. [1] used SSRs from azukibean and Wang et al. [8] used SSRs from mung bean for assessment of genetic diversity in rice bean. In the next year Chen et al. [9] reported the first set of micro-satellite markers isolated from genes of rice bean. The same group [10] later reported isolation of genomic microsatellite markers from SSR enriched library. The present study reports application of a simple and rapid method for constructing microsatellite enriched library and subsequent isolation of microsatellites. In the present study, dinucleotide repeats were the most abundant  Rice bean genetic diversity motif (79%) as was reported earlier (81.6%) by Wang et al. [10]. However, in the study with genic SSRs [9], dinucleotide repeats were much lower (26.8%). Thus, it appears that in the genomic SSRs dinucleotide is the most abundant class. Nevertheless, the repeat motif (CT/GA) was somewhat similar to the previous reports on genic SSRs (AG/CT) of rice bean [9] and mung bean [42]. In the genomic SSRs, AC/GT was reported to be the most frequent type [10]. The proportion of the pentanucleotide class (17.7%), as seen in this studyis higher than all previous studies (0.27 in genic and 0.5% in genomic SSRs respectively). Comparison of the results obtained by us and in the previous studies suggested that both class and motif of SSRs at inter and intragenic sequences are different, which may be due to their distinctive roles at these two levels.
Average numbers of alleles per locus amplified by the primers reported here are more compared to previously reported SSRs from rice bean (6.39 alleles compared to 3 in genic and a maximum of 4 alleles in genomic SSRs) and azuki bean [43]. However, this is much lower compared to the average 12.9 alleles reported by Tian et al. [1] while using azukibean SSRs. The reason for this probably lies in the fact that rice bean SSRs were tested on a set of germplasm collected from a specific area while the azukibean SSRs were tested on a global set, including wild types. PIC values of the present set of SSRs (0.488) is also lower than the azukibean SSRs (0.57) tested on the global set [1].
The set of 65 genotypes were grouped into six populations based on their geographic location of collection, and into two regions based on proximity of collection locations. The hills of Arunachal Pradesh, Manipur, Nagaland and Meghalaya are almost adjacent and, therefore, were grouped as a region. The hills of West Bengal (Darjeeling District) and Sikkim share a common border and the collections from there were grouped as another region. The mean gene diversity observed in the present set (0.534) is comparable to the gene diversity (0.535) observed in the cultivated accession of rice bean in a previous study [1]. However, the observed heterozygosity was higher (0.525 compared to 0.134). In a previous study with the accessions from the present set, Gupta et al. [44] reported rich diversity in ten morphological characters. At the individual population level, the set from Nagaland, Manipur (which are closer to Myanmar and had free movement in pre-independent India) showed gene diversity similar to the report of Tian et al. [1] for Myanmar. The gene diversity (0.549) seen in the population from Darjeeling (bordering with East Nepal) supports the finding of Bajracharya et al. [11] who reported high diversity in the collection from Nepal. High gene diversity was also observed in the population from Manipur and Nagaland.

Nature of the accessions
Most of the rice bean cultivated in the North Eastern Himalayas are similar to wild types in morphological traits except for some from Manipur (viz. MNPL3 used in this study), which are bold seeded. However, rice bean is seldom maintained as a perineal crop. The f (fixation index) values obtained in the 6 geographic location-based populations indicate an exceptionally high out crossing rate, which needs further investigation. Furthermore, the low inter population variation and non-significant differences in allelic counts are probably because of the high outcrossing rate. It is interesting to note that the Manipur population, and the Darjeeling population mostly grouped separately in the PCoA. The Nagaland and Meghalaya populations (Pop3 and Pop6 respectively) showed a tendency to group in these two groups. Half of the Sikkim (bordering Darjeeling) population (Pop 4) also showed grouping with the Darjeeling population. Manipur is an Indian State bordering with Myanmar (Fig 1) while Darjeeling is a bordering district with East Nepal. In a previous study [1], Myanmar and Nepalese populations were reported to group separately. In the STRUCTURE grouping, almost half of the accessions from Arunachal Pradesh, Nagaland, Darjeeling and Sikkim were spread across SP1 and SP2 along with 50% of the accessions from Meghalaya. On the other hand, only one of the accession from Manipur was included in SP2. In SP3, although accessions from Manipur, Nagaland and Meghalaya were included, Arunachal, Sikkim and Darjeeling were unrepresented. It appears that the rice bean population of North Eastern Indian Himalaya is probably related with populations in Nepal and Myanmar as the collections from bordering areas grouped separately. To understand this better, another PCoA (S3 Fig) with the same data was conducted where the admixtures were further grouped into three clusters based on their highest membership coefficient (SP4 -for accessions showing highest membership coefficient for SP1 and so on). Results showed that majority of the admixtures grouped with SP1 and SP3 while a small part of it grouped with few accessions of SP2. As seen from the accession details, the pure SP2 is represented by accessions from Darjeeling, Sikkim (both bordering Nepal), Meghalaya and Nagaland. The accessions of SP2 that grouped with admixtures are 3 from Meghalaya and one from Manipur (bordering Myanmar). The part of the SP3 that grouped with admixtures were one each from Manipur, Meghalaya and Nagaland and the part that grouped with SP1 were all from Meghalaya. Seehalak et al. [41] suggested that rice bean was probably domesticated in Myanmar and Northern Thailand. Tian et al. [1] detected some unique alleles in Western Nepal and Indian Himalayan accessions, which were not detected in Thai, Myanmar or other wild accessions. They indicated a possibility that the Himalayan region may be a domestication center. Our results suggest that the accessions from North Eastern Indian hills (region between Myanmar and Eastern Nepal) are probably related to the populations of both Eastern Nepal and Myanmar. However, a further study with accessions from Nepal, Myanmar and North East India would confirm this.
In summary, the study demonstrated that 5' anchored PCR-based enrichment method is a fast and reliable method for developing new SSR markers, and for a better understanding of the nature of the admixtures identified by STRUCTURE software, PCoA is a valuable statistical tool. This is also the first comprehensive genetic analysis of rice bean accessions from North East Indian Himalayan region, which is considered as the center of diversity of many crops. The results further showed that there is a high level of genetic diversity within the accessions with exceptionally high outcrossing rate.