Massive Sorghum Collection Genotyped with SSR Markers to Enhance Use of Global Genetic Resources

Large ex situ collections require approaches for sampling manageable amounts of germplasm for in-depth characterization and use. We present here a large diversity survey in sorghum with 3367 accessions and 41 reference nuclear SSR markers. Of 19 alleles on average per locus, the largest numbers of alleles were concentrated in central and eastern Africa. Cultivated sorghum appeared structured according to geographic regions and race within region. A total of 13 groups of variable size were distinguished. The peripheral groups in western Africa, southern Africa and eastern Asia were the most homogeneous and clearly differentiated. Except for Kafir, there was little correspondence between races and marker-based groups. Bicolor, Caudatum, Durra and Guinea types were each dispersed in three groups or more. Races should therefore better be referred to as morphotypes. Wild and weedy accessions were very diverse and scattered among cultivated samples, reinforcing the idea that large gene-flow exists between the different compartments. Our study provides an entry to global sorghum germplasm collections. Our reference marker kit can serve to aggregate additional studies and enhance international collaboration. We propose a core reference set in order to facilitate integrated phenotyping experiments towards refined functional understanding of sorghum diversity.


Introduction
Crop domestication is characterised by human selection on wild species for traits useful for food production. This continuous process made possible the development of agriculture and of civilizations. While migrating, man moved together with his crops and spread agriculture worldwide. It led to global development as well as occasional harsh competitions. While many industrial crops have a recent domestication history intermingled with that of colonization, food crops present distributions that have little relation with their domestication place.
Recent global planetary constraints create a new threatening situation; plant breeding is currently faced with unprecedented challenges, which call for global cooperation. Plant genetic resources conceal the matter for future improvement and adaptation. They bear thousands of years of genetic adaptation to multiple conditions and usages by Man. In times when 1) food security is dramatically challenged by population growth, shortage of input supply and climate changes, and 2) genomic tools and methodologies bring about unprecedented capacities of scientific investigation, they are and will remain a stake, matter of competition as well as cooperation.
Sorghum [Sorghum bicolor (L.) Moench, 2n = 2x = 20] is the fifth most important cereal crop in the world. Its use as staple food and fodder confers it the status of a 'failsafe' crop in global agroecosystems. It is widely adapted to harsh environmental conditions, and more specifically to arid and semi-arid regions of the world. It is currently a model crop for tropical grasses that employ C 4 photosynthesis because of the availability of its complete genome sequence [1] [2], http://genome.jgi-psf.org/Sorbi1/ Sorbi1.info.html).
There are several identified collections of sorghum genetic resources (for example core-collections [3] [4], US converted tropical and breeding lines described in [5], US sweet sorghum collection [6], mutant populations [7], Japanese collection [8], as well as accessions available at ICRISAT). Sorghum's center of diversity lies in the northeastern quadrant of Africa and it is thought sorghum was domesticated there over 5,000 years before present [9]. Based on spikelet and grain morphology, Harlan and de Wet [9] developed a simplified classification of traditional sorghum cultivars into five basic races: Bicolor (B), Caudatum (C), Durra (D), Guinea (G) and Kafir (K), and ten intermediate races (in all pair-wise combinations of basic races).
In front of the large size of the collections available and the diversity of interests expressed in the various studies, we undertook this study in order to provide a better insight into global sorghum genetic diversity and to set a reference, which can attract interest, stimulate cooperation and coordination and enhance interactions and connections among all initiatives. A large collection of sorghum (global composite germplasm collection, GCGC) including over 3300 accessions was thus genotyped with highly polymorphic markers (41 SSRs) providing coverage across all 10 chromosome pairs in the nuclear genome of Sorghum bicolor. This was performed in the frame of the Generation Challenge Programme (GCP, www.generationcp.org). It may provide a foundation for more efficient management and utilization of available genetic resources in this crop, as well as a tool for mining alleles of genes controlling important agronomic traits.

Plant Material
Sorghum material studied was mainly selected among ICRI-SAT's collection ( [4]), since ICRISAT has one of the largest crop germplasm collections held in trust by the Consultative Group for International Agricultural Research (CGIAR). ICRISAT's collection includes germplasm of staple food crops of the semi-arid tropics including sorghum, pearl millet, groundnut, pigeonpea, chickpea and several small millets (foxtail millet, finger millet, etc). Chinese material was under-represented in ICRISAT's collection; so it was complemented with material provided by CAAS. It also included a previously defined core collection, mainly from ICRISAT's collection and extensively studied ( [3]). A total of 3367 sorghum accessions were thus studied in this paper, representing cross-compatible sorghum germplasm of broad initial taxonomic status (passport information available in Table S1). This GCP sorghum GCGC included 280 breeding lines and elite cultivars from public sorghum breeding programs, 68 wild and weedy accessions, and over 3000 landrace accessions from collections held by CIRAD or ICRISAT that were selected either from previously defined core collections ( [3], [4]), for resistance to various biotic stresses, and/or for variation in other agronomic and quality traits. All three labs, CAAS-China, CIRAD-France and ICRISAT-India, contributed accessions to the study. CIRAD contributed 225 well-characterized genotypes that constitute a mini-core collection representing a very broad range of diversity [11], CAAS contributed 250 accessions comprising sweet sorghums, grain sorghums and glutinous sorghums from China, and the remaining accessions were contributed by ICRISAT. All accessions from this sorghum GCGC collection are publically available, except the 250 provided by CAAS. This collection included representation of all 5 basic races of cultivated sorghum [Bicolor (B), Caudatum (C), Durra (D), Guinea (G) and Kafir (K)] and their ten intermediate collected from different parts of the world ( Table 1). All together one third of the accessions were provided by all ten intermediate races (1159 accessions), while the largest numbers of basic races were represented by Durra (651 accessions) and Caudatum (577 accessions).

DNA Extraction
DNA extraction was carried out in the labs contributing the sorghum entries to this study, with a single representative plant providing the DNA for each accession, following a protocol described by [36] for accessions contributed by ICRISAT and as described in [37] for accessions contributed by CIRAD and CAAS. Extracted DNA samples were exchanged between the labs for SSR marker genotyping.
Markers gpsb069, gpsb148, gpsb151, Xcup62 and Xtxp295, were genotyped at CAAS according to the same protocol used at ICRISAT, except that amplification products, along with ROX-400 size standard, were separated by capillary electrophoresis in single-marker runs.
In all three labs, three control panel DNA samples were used as standard checks ( [24], http://sorghum.cirad.fr/SSR_kit), in every PCR and electrophoresis run to facilitate accurate allele calling.

Data Analysis
SSR markers used in this study showed high reproducibility in PCR amplification and ABI/Licor runs based on the allele sizes produced by control panel entries that were included in every PCR run. SagaGT software (Licor, USA) was used for allele scoring for the markers genotyped at CIRAD. At ICRISAT and CAAS, fragment analysis of PCR products was carried out using GeneScan and Genotyper 3.7 software packages (Applied Biosystems, USA). PCR amplicon sizes were scored in base pairs (bp) based on migration relative to the internal ROX-400 size standard. At ICRISAT these raw allele calls were further processed through the AlleloBin software program (available at http://www.icrisat.org/bt-software-d-allelobin.htm) to provide adjusted allele calls. AlleloBin uses a standard repeat motif length (following the step-wise mutation model [38]) and a least squares algorithm to call allele sizes to integer values as suggested by Idury and Cardon [39], adjusting for imperfections in the co-migration of size standards and PCR products.
Marker data for 7 SSR markers (gpsb069, gpsp089, gpsb148, gpsb151, Xcup62, Xtxp295 and Xtxp33) were removed from the final analysis due to incomplete data or low quality genotyping. Finally, 3367 accessions were retained for further analysis across 41 markers (Table 1). Data files were assembled in a database (Sagacity v.10, Rami, in preparation) and allele sizes were checked for congruency and adjusted according to the allelic references provided in the SSR kit [24].
Descriptors of observed genetic diversity, such as allele number per marker, observed heterozygosity (Ho) and gene diversity (expected heterozygosity, He) were calculated using PowerMarker v3.25 software [40]. Allelic richness and private alleles by locus were estimated using ADZE software [41]. Genetic distance between groups, estimated by F st statistics, was calculated with hierfstat R package [42]. Mann-Whitney (MW) tests were used to determine whether estimates were significantly different between groups.
To identify the pair-wise genetic relationships between the accessions of this sorghum global composite germplasm collection, a genetic dissimilarity matrix was calculated using simple matching with DARwin v5 software [43] (available at http://darwin.cirad. fr/darwin/Home.php). An overall representation of the diversity structure was obtained by a factorial analysis using the distance matrix, while individual relations were analyzed with a tree construction based on Neighbor Joining (NJ) method, as implemented in DARwin v5.
In order to test for sample clustering in conjunction with admixture between sub-groups, Bayesian statistics based on Monte Carlo Markov Chain algorithm were used. Although the Instruct software package [44] was developed to handle specifically species with a high level of inbreeding, as expected for sorghum, it was not used here because it cannot handle such a large number of samples. STRUCTURE software v.2.3.3 [45] was thus preferred. One hundred replicates were performed for each K, the number of clusters considered. Each run used a burn-in period of 100,000 iterations followed by 200,000 iterations. For each K, the 10 runs presenting the highest maximum likelihood value were kept, and sample assignation to groups was performed with CLUMPP software (up to K = 6, greedy algorithm, 1000 repeats, over K = 6, large K greedy algorithm, 1000 permutations) in order to deal with label switching or multimodalities. Estimate of the best cluster number was performed following [46] with a R (http://www.rproject.org/) script modified from [47]. It was compared to information given by each cluster, and identified when no new individual presented a majority of ancestry in a new cluster (threshold 0.7). Genome plot representations were performed using a specifically developed R script (available upon request). A Reference Set of 383 sorghum accessions including S. bicolor subspecies bicolor and wild S. bicolor subspecies verticilliflorum was chosen among the publically available accessions to best represent genetic diversity as well as geographic origins. Maximum Length Subtree function of DARwin v5 software [43] was used to deal with genetic diversity. It is based on successive elimination of samples, each eliminated sample presenting a minimal reduction of overall diversity, measured as branch length of a tree. Since in the GCGC collections, phenotyping data were already available on a subset of diverse accessions ( [11], [4]), this subset was first analyzed to reduce redundancy. Widely used breeding lines completed it. A first run of completion of these accessions was performed on S. bicolor only, checking that all geographic origins are conserved. The same process was performed for wild accessions, and both datasets were merged to represent the Sorghum Reference Set.

Global Variation
Level of polymorphism. All 41 SSR markers used detected polymorphism in the sorghum GCGC. A total of 783 SSR marker alleles were detected, with an average of 19.2 alleles per marker. Numbers of alleles per marker ranged from three (Xtxp136) to 39 (SbAGB02), with an average of 3.44% of missing data ( Table 2).
A mean gene diversity (expected heterozygosity, He) of 0.67 was observed across the sorghum global composite collection, with values ranging from 0.24 (mSbCIR246) to 0.94 (Sb5-206) for individual markers (Table 2). Even though SbAGB02 produced the highest number of alleles (39), it presented an intermediate He value of 0.67 because 92% of these alleles can be considered as rare (74% below 1% frequency). With the exception of mSbCIR248, which had an unusually high observed heterozygosity (Ho) value of 0.23, the Ho values ranged from 0.01 (mSbCIR246) to 0.06 (Xtxp015) with a mean of 0.03. Its outstanding Ho value suggests that marker mSbCIR248 may have detected more than one polymorphic locus, but this is not confirmed yet by in-silico hybridisation to the complete reference sorghum sequence.
Allelic distributions among taxonomic components. Allele number distribution and genetic diversity in sorghum GCGC according to biological status, race, and geographic origin is reported in Table 3. All 41 SSR markers used detected polymorphism in all compartments. The 3013 landrace accessions (87% of total accessions) contributed 94% of SSR marker alleles detected, all breeding lines (including advanced cultivars, 280 accessions, 8%) and wild and weedy accessions (68 entries, 2%) captured 57% and 65% of the detected alleles, respectively. Allelic richness of standardized sample sizes of 100 haploid genomes showed that breeding lines tended to present less genetic diversity compared to landraces and wild samples, and that wild samples appeared more diverse (MW test, non-significant P values, P = 0.15 for breeding-landrace comparison and P = 0.08 for landrace-wild comparison). This is confirmed for private alleles (MW test, P,0.05 and P,0.01, respectively), with three times more private allele numbers in wild and weedy samples than in landraces (3.25 vs 1.04) and larger average expected heterozygosity values (MW test, P = 0.017).
Except of Kafir, the other four basic races exhibited no significant difference in allele numbers per marker. Kafir presented the smallest numbers of alleles per marker and private alleles (almost 3 alleles per marker less than the four others, MW tests, P,0.001) and a lower genetic diversity (He = 0.41 versus He of 0.60-0.67 for the other four basic races). The Guinea race encompassed the Guinea margaritiferum (Gma) accessions (at least 12), for which two markers (mSbCIR240 and Xcup53) were found to be monomorphic, whereas allelic richness of same sample sizes of all races, including other Guineas, ranged 1.58-7.02 and 1.56-3.04, respectively.
Highest numbers of alleles 680 (86.8%) were detected among the accessions of African origin. When correcting for sample sizes at the continent level, North American accessions (all originally introduced from elsewhere, or derived from such introduced materials) tended to be more diverse both in terms of total numbers and private alleles, but the MW tests were not conclusive. In Africa, Eastern Africa exhibited the largest gene diversity, followed by Central Africa while Southern Africa was the poorest (MW test, P = 0.02). In Asia, Middle East origins presented a higher genetic diversity than India and East Asia (MW test, private alleles, P = 0.05).
Among the 41 SSR markers analyzed, 17 markers produced alleles unique to wild/weedy accessions, three (mSbCIR276, Xisep0107 and Xtxp136) for cultivated accessions, and Xisep0310 did not detect alleles unique to either the cultivated or wild/weedy accessions. Among these 17 SSRs, eight markers (gpsb067, gpsb123, mSbCIR223, mSbCIR238, Sb5-206, Xcup02, Xcup53 and Xtxp265) detected only one allele unique to wild/weedy accessions and a maximum of six such alleles were detected for marker Xtxp273. Out of the 68 wild/weedy accessions included in this study, 37 accessions produced these 40 alleles that were not detected in cultivated accessions. Wild accession IS 18931 alone contributed six alleles that were not found among the cultivated accessions and IS 18818 (of the aethiopicuum group within S. bicolor subspecies verticilliflorum) contributed five such alleles. Three alleles that were not detected among the cultivated accessions were detected in the only accession of S. propinquum (IS 18933) included in this global composite germplasm collection. Among the 3299 cultivated accessions, 40 of 41 SSR markers detected alleles not found among the 68 wild/weedy accessions. This is probably related to sample sizes differences and to the fact that SSR markers used in this study were chosen for their genome-wide distribution, based on existing maps built from crosses of cultivated accessions only, representing thus a diversity compartment different from wild/ weedy entries.
The largest number of alleles unique to cultivated accessions was detected for mSbCIR240, for which 24 out of 35 alleles detected in the global collection were detected only in cultivated accessions, but no alleles of this marker were detected only in wild/weedy accessions. The overall frequency of rare marker alleles in the sorghum GCGC was very high. Across the 3367 accessions, 428 rare alleles (54.2%) below 1% frequency and 621 rare alleles (78.7%) below 5% frequency were detected.

Patterns of Multi-locus Diversity
Factorial analysis. Factorial analysis (FA) of the SSR-based dissimilarity matrix of the complete sorghum GCGC (3367 accessions) showed that the first four axes were to be considered (See plot in Figure S1). The first axis enabled the separation of accessions collected in Africa versus more eastern origins (including some of eastern Africa) (6.05% of the global inertia) (Figure 1). The second (4.09%) and third axes (2.92%) refined the situation of Africa by separating southern Africa and western Africa from central and eastern Africa. Finally the fourth axis (2.35%) enabled the separation of origins from the Indian subcontinent, the Middle East, and eastern Asia. The reference to the racial classification ( Figure 2) Table 3. Genetic diversity in the sorghum Global Composite Germplasm Collection (GCGC) and in the Reference Set, partitioned into biological status, races and geographic origins as indicated in passport data.   Neighbor joining analysis. The NJ dendrogram representation on all samples revealed global congruence with the Bayesian assignment with a few apparent discrepancies (Figure 4). The main discrepancies were the splits of Group 5 and Group 9 into distinct dendrogram sectors. Within Group 5, this split corresponded well with a Bicolor vs Guinea differentiation and led to the distinction of 5a and 5b. Group 9 split into three components 9a, 9b and 9c, 9a and 9b being essentially made of Guinea varieties from South Asia and eastern and southern Africa, respectively, and 9c made of a few Caudatum varieties from eastern Africa. The NJ analysis also threw light on an array of unclassified accessions in the periphery of groups 1 and 2, consisting predominantly of Durra and DC accessions from the Middle East. Group 3 was also challenged by the NJ representation, with most accessions in one dendrogram sector but several of them in another; the size of this group was, however, too small for justifying internal sub-divisions.
Distribution of taxonomic components. The classification derived from the STRUCTURE analysis complemented by the NJ dendrogram enabled analyzing the distribution of the various a priori taxonomic components. The NJ dendrogram further helped locating all the unassigned materials in relation to the groups that it supported or revealed.
Wild and weedy sorghum accessions were mainly found in four dendrogram sectors (Figure 4). Almost two-thirds (40) of accessions of S. bicolor subspecies verticilliflorum (belonging to races aethiopicuum, arundinaceum, verticilliflorum, and virgatum) of diverse origins, as well as weedy intermediate S. bicolor subspecies drummondii clustered around Group 5b. A separate group of drummondii and verticilliflorum accessions from eastern Africa was observed around Group 9c, associated with cultivated materials from Sudan and Uganda.
Another group of drummondii accessions from Tanzania, Kenya and Zimbabwe were clustered around Group 9b materials from southern and eastern Africa. Finally, a group of wild and weedy accessions from eastern Africa clustered around Group 4 in close proximity to intermediate race Durra-Bicolor accessions from that region.
The 195 accessions classified as race Bicolor were scattered across many dendrogram sectors and no distinct Bicolor cluster was observed, other than Group 5a, comprised of accessions specifically collected to represent ''broom sorghum''. However, the periphery of Groups 1, 2, 4, 5b, 9a and 10 appeared Bicolorenriched. The four Bicolor accessions close to Group 5b fell among wild/weedy accessions.
Guinea accessions were mainly grouped into four separate dendrogram sections (Figures 3 and 4). Some Guinea accessions, mainly roxburghii sub-race materials from the Indian subcontinent and southern Africa, were in Group 9a. A large number of Guinea race accessions from southern Africa (mainly of the conspicuum and roxburghii sub-race materials from Tanzania and Malawi) were clustered in Group 9b. Another large cluster of Guinea race accessions, mainly from western Africa (Mali, Ghana, Nigeria, Burkina Faso, etc.) and including sub-races gambicum and guineense, were found in Group 6. Accessions of the margaritiferum (Gma) subrace from western Africa formed a separate Group 5b in close association with wild and weedy accessions.
Caudatum race accessions (577) were broadly dispersed. The vast majority originated from eastern Africa and grouped in and around Groups 7 and 9c. The others followed a geographic organization, with accessions from China in Group 1 and accessions from western Africa and southern Africa in Groups 6 and 10, respectively.
The Durra race was the most widely represented in the GCGC (656 accessions). Most were distributed across several major clusters, with a strong geographical organization. Most Durra accessions from the Indian subcontinent were in Group 2 along with related intermediate materials from that region. Accessions from eastern Asia (mostly from China) were found in Group 1 and accessions from the Middle East and eastern Africa fell in the components of Group 3, while smaller numbers of Durra accessions were in the periphery of Groups 6, 7 and 9c. Interestingly, five Durra accessions clustered with wild/weedy accessions in the vicinity of Group 5b.
The Kafir accessions (239) were mostly from southern Africa and fell in Group 10, together with Kafir-Caudatum and Kafir-Durra accessions from the same region.
The majority of intermediate race accessions were grouped according to their geographic origin. Guinea-Caudatum (GC) was the most common (361 accessions) and was scattered across all NJ sectors, with a majority in the vicinity of Group 7. Durra-Caudatum (DC) was the next most common intermediate race (330 accessions), and was geographically distributed around Group 6 (western Africa) and around Groups 1 and 2 (Mediterranean Basin and the Middle East). Caudatum-Bicolor (CB) accessions were predominantly from eastern Asia and fell in and around Group 1 whereas Durra-Bicolor (DB) accessions from the Indian subcontinent and eastern Africa fell in and around Groups 2 and 4, respectively. Ten intermediate race accessions grouped with wild/weedy accessions close to Group 5b.
A total 430 trait-specific accessions were included in the sorghum GCGC. Many of them were classified as race Caudatum, including accessions resistant to downy mildew, which were clustered according to their origins in Groups 2, 6, 7 and 9c. Stem borer resistant genotypes of race Durra from the Indian subcontinent and Africa were grouped together in Group 2. Genotypes with the capacity to germinate through crusted soil were found in various groups in accordance with their origin and race. Most midge resistant genotypes were found in Group 7. Most of the sweet stalk sorghums that are of increasing interest globally were observed to have Caudatum race background and fell into Group 7. Broom sorghum accessions of race Bicolor from USA formed a specific single Group 5a, whereas all pop sorghum accessions belonging to race Guinea from the Indian subcontinent grouped together in Group 9a. The latter two groups are both small in size and might actually exist because of an overrepresentation of specialty sorghums gathered for a targeted purpose and resting on a narrow genetic basis.

Global Differentiation Pattern
The differentiation between all the components derived from the confrontation of both classification methods was assessed using the F ST estimate (Table S2 and Figure 5b). Pairwise F ST estimates between the 13 groups identified were all significantly different from zero and varied from 0.130 to 0.531, with a mean value of 0.378.
The relationships based on the final groups, their mutual differentiations measured with F ST estimate, the distribution of the various races and intermediates in the NJ dendrogram are summarized in Figure 5.
With the exception of Groups 5a and 9a, sorghum genetic diversity appears organized along a limited number of clearly differentiated groups in the West (Guinea-dominated, yet clearly different from one another, Groups 5b and 6), in the South (Kafirdominated Group 10), in the East (multiracial Groups 1, 2 and 9a) and in the Center (Durra/Bicolor Group 4 and Durra-dominated Group 3), within a background that appears as a broad swarm in central and eastern Africa (weak structure between Groups 7, 8 and 9) with a frequent reference to the Caudatum race component.

Reference Set of Sorghum
A core reference set with 383 accessions was selected to capture the global genetic diversity of sorghum (Table 3). It includes 332 landraces, 28 breeding lines and 23 wild/weedy accessions, all five cultivated basic races, the 10 intermediate races and accessions of all different geographic origins except South America. It represents the global genetic diversity present in sorghum GCGC ( Figure 6). This sorghum reference set captured 78.3% (613 alleles) of the SSR alleles detected in the GCGC, with an average of 14.9 alleles per SSR primer pair (Table 3), comparable to standardized allelic richness of the GCGC. For markers mSbCIR306 and Xisep0310, all alleles (5 and 10 alleles, respectively) detected in the GCGC were captured in the reference set. Average gene diversity (0.71) in the reference set is slightly larger than for the GCGC. Clustering of accessions in the reference set follows the pattern of race within geographic origin described above for the GCGC. In the case of Gma sub-race, 11 of 12 accessions included in the global composite germplasm collection (all from western Africa) were captured in the reference set.

Discussion
Maintenance and characterization of large germplasm collections is a huge task. Knowledge of the characteristics of the materials is essential for their efficient management. Both genetic and morpho-agronomical characterizations are required for breeders to better understand and use the available genetic resources. It increases the efficiency of selection of more diverse, adapted, germplasm parents in crop improvement programs. To serve as an entry point to large collections, representative subsets (often referred to as core or minicore collections) provide an economically and logistically attractive option for both gene banks and the breeding programs they serve. However, it is very important that such core collections represent the full range of diversity available at the time of the study. In this context, we used SSR markers to ascertain the population structure of a very large set of sorghum germplasm, in the framework of an international project (the Generation Challenge Programme), consisting of accessions assumed to be representative of global germplasm available for improvement of this crop. This set was used to finetune and complete previous knowledge on the evolutionary history and domestication pattern of sorghum. Using this information, a representative subset of this collection was chosen, of a more convenient size for detailed characterization of traits of economic importance to plant breeding programs and for the assessment of allelic diversity in genes associated with variation in such traits.  Figure 4 are drawn according to their geographical distribution and the predominant race(s) is (are) indicated. The groups are framed differently to reflect their higher (thicker frame) or lower (thin, dotted frame) levels of differentiation as estimated through the F ST parameter and the distribution of intermediates. Group 5a actually originates from a collection in USA. b-NJ dendrogram of F ST distances between groups identified as in Figure 4. c-Pure races and main regions are predominantly featured, but the intermediate types or regions fall in continuity with this landscape (dotted lines). Races are framed differently to reflect their higher (thicker frame) or lower (thin, dotted frame) levels of differentiation as estimated through the F ST parameter as in Figure 5a. doi:10.1371/journal.pone.0059714.g005 A Species-wide Scan Assessment of Neutral Genetic Diversity Breadth of variation. To our knowledge, this is the largest study undertaken in a systematic way for exploring genetic diversity in global crop germplasm. The broad plant material coverage resulted in larger allele numbers (average of 19.2 alleles per locus) and higher diversity parameters than in most previous studies ( [48], [49], [50], [51], [32]). It was comparable to the features reported in a focus study on Niger by [30], [52]. The same is true when considering each race separately.
The mean observed heterozygosity (Ho) was 0.037, indicating that most markers used detected only one allele per accession, and that the accessions are highly inbred, as expected for accessions of a largely self-pollinated species maintained in collections by enforced selfing. This comparison is notably different when using samples directly derived from landraces, e.g. 0.11 in a Cameroon village [29] or 0.09 in a mix of Guinea race accessions [35].
Relevance and distribution of taxonomic components. The high level of genetic diversity in sorghum is thought to be due to multiple origins for domesticated sorghum, intermating between products of these independent domestication events, and continued gene flow between wild and cultivated sorghums [53]. In this study we found substantial evidence of sorghum population structure based on geographic origin and race within geographic origin. This is congruent with previous studies with RFLP markers [11], SSRs [30], SSRs and SNPs [54], and also with recently developed DArT markers ( [55], [22]). Yet the structure we observed led us to propose a schematic representation of population structure in sorghum ( Figure 5). The periphery harbors types that are more clearly differentiated and more homogeneous. The center harbors more diverse types, with many more intermediates and a concentration of wild types that appear related to several cultivated forms. Among the cultivated accessions, there is hardly any coincidence between a race and a group based on markers, with the single exception of Kafir in Southern Africa. The ''races'' might better be referred to as morphotypes, or at least consider that races could encompass different morphotypes. The 68 wild and weedy accessions presented the highest gene diversity and private allele numbers. The majority of wild/weedy sorghum felled in the periphery of Group 5b, but as previously discussed, they were not definitely assigned. The other wild and weedy accessions were distributed, yet on long branches (see Figure S3), in three other dendrogram sectors predominated by cultivated accessions. This contrasts with Aldrich and Doebley's (1992) results [12], who found a clear separation between the two compartments using RFLP markers, as well as Casa et al. 2005 [50] who confirmed this fact with SSR markers. Clearly, the exploration of diversity in a broader representation of wild sorghum is necessary. One can retain yet the broad distribution of wild and weedy accessions throughout the cultivated sorghum diversity patterns, which adds evidence to a corpus of results (including [9], [56], [57], [58], [31]) that suggests that there is considerable exchange of genetic material (gene-flow) between cultivated and wild accessions.
A global interpretation of sorghum genetic diversity. Altogether, the geographical pattern of differentiation, the limited congruence between marker-based classifications, the racial classification based on morpho-agronomic traits and the likely occurrence of profuse gene flow advocate for a diversity pattern largely determined by 1) geographical radiation in various directions from the center of origin, with both differential drift among lineages and possibly novel variation selected along the process, 2) common gene exchange among landraces and local wild types, ensuring population dynamics, and 3) selection for race-related trait associations responsible for phenotypic convergence between genetically differentiated sub-populations. Germplasm introduction explains the diversity of the materials contributed to the sorghum GCGC from North America, whereas loss of alleles due to drift appears to have contributed to the reduced diversity observed among samples from India and East Asia compared to those from Africa, with the latter contributing to the observed groupings. In this scenario, it is likely that the genes that underlie the morphological differences between the most typical morphotypes are few in number and deploying visible polymorphism across geographically differentiated groups. This scenario will be testable when whole-genome genotyping is available in sorghum and may reveal footprints for natural and anthropogenic selection along the genome.

Community Resources
Data. Data generated in the present study was deposited in the GCP central registry (http://generationcp.org/research/ research-themes/crop-information-systems, using Sorghum as a 'crop' filter, file G2005-01c_Sb_3393accX41SSR_V2.xlsx) and is accessible to the global community. They come in addition to passport data that are available in the germplasm banks and occasional evaluation data that may have been produced as part of searches for donors of specific traits to be used in breeding programs. The data can serve as a reference since it was obtained with an easily accessible kit of markers [24] that can be used on any new material for comparison.
Reference set of sorghum. We used the marker data and population structure of the sorghum GCGC from this study to identify a much smaller representative subset of accessions, called the 'Reference Set'. This Reference Set provides an entry point to sorghum germplasm globally, to identify geographic regions and racial subgroups from which sorghum accessions exhibiting interesting variability in a particular trait can be found. The general value of an internationally agreed set of representative germplasm to serve as a common reference for focussing characterization has been highlighted elsewhere [59].
This proposed Reference Set consists of 383 accessions, includes important germplasm lines used in crop breeding programs, wild accessions and a mini-core collection of genetically diverse accessions for which considerable phenotypic data is already available. Five basic morphological types, ten intermediate ones and wild/weedy accessions from nearly all geographic origins were captured in this sorghum Reference Set. This set represents most Figure 6. Selected sorghum Reference Set (383 accessions, in black) in relation with hierarchical NJ cluster analysis of 3367 sorghum accessions of a global composite germplasm collection based on allelic data from 41 SSR markers (simple matching distance). Accessions grouped by Bayesian analysis are represented in colors as in Figure 4: Group 1 in orange (C, CB and D from Eastern Asia), Group 2 in light orange (D and B from the Indian subcontinent) Group 3 in light green (D from Eastern Africa), Group 4 in light blue: B and DB from Eastern Africa), Group 5a in dark blue (B accessions assembled from North America) and Group 5b in dark blue (Gma), Group 6 in red (D, DC and G from Western Africa), Group 7 in light purple (C from Central and Eastern Africa), Group 8 in dark green (C and GC from Southern Africa), Group 9a in pink (G from the Indian subcontinent and East Asia), Group 9b in pink (G from Southern Africa), Group 9c in pink (C from Eastern Africa), and Group 10 in purple (GC, K and KC from Southern Africa). Unassigned accessions are presented in grey. doi:10.1371/journal.pone.0059714.g006 of the genetic diversity present in the GCP sorghum Global Composite Germplasm Collection, with all assignation groups and clusters represented. It has a population structure similar to that discussed above for the sorghum GCGC, yet with less redundancy in highly populated narrow clusters. Compared to previously described subsets ( [5], used e.g. in [54]) which include converted lines with photoperiod-insensitivity and dwarfing genes, this reference set includes all types of material, enabling breeding choice in Africa. Besides, it also includes wild samples, is more balanced in terms of initial racial classification (more Guinea and less Caudatum in proportion), and represents all geographical origins (and correlatively to racial belonging, represents best West Africa). Seeds are maintained by ICRISAT and available upon request. All passport data published in the System Wide Information Network on Genetic Resources (SINGER), including Sorghum, are available in Genesys (http://www.genesys-pgr.org/ ), which aims at being the global information system on the germplasm held ex situ.
Perspectives. The core reference set is expected to stimulate links among sorghum scientists. The data have been analysed with several methods, which provide marker-based keys to germplasm classification and are meant to serve as a reference. Any new material can easily be compared to this reference; these markers are easily applicable for local studies with local questions in local laboratories, and yielding results that are comparable to other studies performed elsewhere, thanks to the use of a common kit of markers and standards.
This will be very useful for identifying germplasm action priorities, for enriching global collections if novel types are uncovered or for broadening the basis of a given breeding program.
Having the data available for the whole GCGC for 41 SSR loci provides a considerable backup for mining germplasm diversity. Molecular data can serve for complementing reference materials with additional germplasm targeted towards particular applications depending on the operational constraints, the biological constraints (e.g. phenology) and statistical power. Typically SSR data can be used to adjust a sample to a target size with the view to minimizing population structure in order to maximize resolution power in a given association analysis; the Maximum Length Subtree function of the DARwin software can help do this easily, quickly and rigorously [43]. This dynamics will also enable adjusting the reference set by making it inclusive of newly characterized diversity.
In the long term, helping a global community to focus on similar materials for all sorts of biological investigations will help accumulate and compile data in order to develop better biological understanding of sorghum, and of plant biology thanks to sorghum.