Origins and geographic diversification of African rice (Oryza glaberrima)

Rice is a staple food for the majority of the world’s population. Whereas Asian rice (Oryza sativa) has been extensively studied, the exact origins of African rice (Oryza glaberrima) are still contested. Previous studies have supported either a centric or a non-centric geographic origin of African rice domestication. Here we review the evidence for both scenarios through a critical reassessment of 206 whole genome sequences of domesticated and wild African rice. While genetic diversity analyses support a severe bottleneck caused by domestication, signatures of recent and strong positive selection do not unequivocally point to candidate domestication genes, suggesting that domestication proceeded differently than in Asian rice–either by selection on different alleles, or different modes of selection. Population structure analysis revealed five genetic clusters localising to different geographic regions. Isolation by distance was identified in the coastal populations, which could account for parallel adaptation in geographically separated demes. Although genome-wide phylogenetic relationships support an origin in the eastern cultivation range followed by diversification along the Atlantic coast, further analysis of domestication genes shows distinct haplotypes in the southwest—suggesting that at least one of several key domestication traits might have originated there. These findings shed new light on an old controversy concerning plant domestication in Africa by highlighting the divergent roots of African rice cultivation, including a separate centre of domestication activity in the Guinea Highlands. We thus suggest that the commonly accepted centric origin of African rice must be reconsidered in favour of a non-centric or polycentric view.


36
History and relevance 37 Rice is the world's most important cereal crop. As a staple food for more than half of the 38 world's seven billion people, it is of crucial importance in providing food security for an to higher rainfall and reduced salinity [19]. In addition, this study supports a period of low-intensity 130 cultivation that may have started as early as 10,000 years before the effective population size reached 131 a low point around 3000 years ago, when African rice was reputedly domesticated.

133
One or multiple origins? 134 Thus far, no conclusive evidence has been provided in favour of either the non-centric or the 135 centric domestication hypothesis. Although the first genome study of African rice supported a single 136 origin of domestication [18], the suggested place of origin is not located along the Inner Niger Delta 137 as suggested by Portères [2], but rather in what Portères proposes to be the secondary diversification 138 centre(s) on the Atlantic coast. In contrast, the competing scenario of diversification in this region 139 makes no claims as to the original centre of domestication, and even provide evidence for the 140 protracted model [19]. None of the other genetic studies mentioned in the previous section were able 141 to pinpoint a clear centre of domestication. In addition, evidence regarding population structure is 142 inconclusive and varies from no observed structure at all [15,17], to clearly differentiated [11,16]  155 glaberrima and O. barthii accessions from across the species range (S1 Table and  methods. In short, we wanted to 1) confirm the genetic bottleneck in African rice as a result of 159 domestication; 2) identify which alleles have been driven to (near) fixation as a result of artificial 160 selection; 3) discover population structure and differentiation within and between the two species; 4) 161 assess the influence of geography on the distribution of genetic variation; 5) explore discordances 162 between the evolutionary histories of candidate domestication genes and the genome-wide species 163 tree; and 6) predict the functional relevance of gene regions with divergent histories.

189
Relative nucleotide diversity between the two species was significantly lower in the cultivated 190 species (π c = 0.0007) than in the wild species (π w = 0.0013) at p < 1.0E-05. The relative nucleotide 191 diversity (π w / π c ) was found to be 1.87 across the genome, but was markedly higher in some regions

275
The collection sites of OG accessions suggest that the observed population structure has a 276 strong geographic component ( Fig 3B). Whereas most of the coastal accessions belong to either OG-I,

277
OG-II or OG-III, the majority if accessions sampled inland belong to OG-IV and OG-V. Geographic both species together and separately. We were unable to detect any such association in O. barthii alone. This may be due to the low number of geo-referenced individuals, a lack of geographic 283 structure or both. When both species were pooled together ( Fig 3C) or O. glaberrima was taken alone 284 ( Fig 3D), however, latitude and longitude were found to be significantly correlated with either of the  (Table 2). This pattern is mirrored by the number of segregating sites 307 remaining in each population after removing monomorphic SNPs, which is again the largest for OG-II 308 and the smallest for OG-IV (Table 2). The opposite trend can be seen for average nucleotide diversity, 309 which is smaller in the coastal populations (π < 1.0/kb) than in the inland populations (π > 1.0/kb).

310
We thus observe that, even though the total number of polymorphic sites in larger in OG-I through   311 OG-III, the average number of pairwise differences between these individuals is lower. This suggests 312 that a smaller number of individuals carries a larger fraction of the polymorphic sites, which is 313 consistent with a population expansion scenario, as explained before.
314 315  (Fig 4). This correlation was stronger than the observed correlation within any  OsLG1 and Sh4), the segregating haplotype that is farthest removed from the major clade in these 386 genes is composed O. glaberrima individuals that are exclusively from the OG-II population (Table   387 3). A closer inspection of the neighbouring O. barthii accessions reveals that their closest relatives all 388 belong to the OB-B subpopulation, rather than the expected OB-C and OB-D populations (Fig 6). A 389 re-examination of the other trees subsequently shows that some of the OG-II accessions also cluster 390 with OB-B in other genes (Table 3)

589
Although it has been shown that cultivated African rice is much less genetically diverse than 590 wild African rice, the ADMIXTURE analysis shows that O. glaberrima is all but homogeneous, even 591 compared to O. barthii. The subdivision of the species in coastal and inland populations is suggestive 592 of geographic structure, as is the differentiation along a north-south gradient on the coast. Contrary to 593 expectation, moderate isolation by distance was observed in three out of five genetic sub-populations.

594
Due to data limitations, isolation by distance could not be assessed in O. barthii.

692
in the total population, and σ 2 S is the variance in allele frequency between the two sub-populations.

693
Relative nucleotide diversity was calculated as the ratio of π in O. glaberrima to π in O. barthii,

695
(1) Here π ij is the number of differences per site between sequences i and j, x i is the frequency of sequence 697 i, x j is the frequency of sequence j and n is the total number of sequences in the data set. The results 698 were deemed sufficiently comparable to proceed with both call sets (S14 Fig). In addition, site depth, 699 call rate and mean heterozygosity per individual were calculated for all accessions using VCFtools 700 (v0.1.14). An overview of these statistics and the number and types of SNPs in the two sets can be 701 found in S15 been partly shaped by demographic history, these scans thereby each to some extent control for the 754 confounding effect of past fluctuations in population size. In comparative analyses of a number of these CLR methods using simulated data, SweeD and OmegaPlus were shown to outperform other 756 tests [47]. While SweeD is capable of taking into account the polarisation of alleles in the so-called 757 'unfolded' SFS, this has the disadvantage that limiting the analyses to only unfolded SNPs causes a 758 significant loss of data. OmegaPlus has no feature to distinguish between ancestral and derived 759 alleles, but has the added advantage of explicitly taking into account patterns of Linkage 760 Disequilibrium (LD). Extensive LD is observed in the O. glaberrima genome, with r 2 reaching half its 761 maximum value at a distance of 175 kb and approaching baseline at 300 kb [19]. For these reasons,

762
OmegaPlus was chosen as the preferred method.

763
The ω test statistic is calculated as:

764
(2) = (( 2 ) + ( - Here W is the number of segregating sites, divided into two groups: one from the first to the th 766 polymorphic site on the left and the other from the (l + 1) th to the last polymorphic site on the right. L 767 and R represent the left and right set of polymorphic sites, respectively, and r 2 ij is r 2 , a common allele frequencies of SNP1, q 1 and q 2 are the allele frequencies of SNP2, and D measures the absolute 771 difference between the observed and the expected haplotype frequencies p 1 q 1 , p 2 q 1 , p 1 q 2 and p 2 q 2 772 respectively.)

773
Values of ω were log-transformed, prior to creating Manhattan plots in R (v3.3.2) using the 774 'qqman' package [49]. Windows containing domestication genes with previous evidence for positive 775 selection were highlighted. Positions within the top 0.5% values were considered candidate regions.

776
To verify whether these candidate regions show other characteristic signatures of selection, the ω-777 statistic was plotted against overlapping 25 kb windows of π and Tajima's D, respectively (S5 Fig). 778 Since common outliers were rare and would require lowering the threshold for OmegaPlus, we did not 779 report any common outliers as candidate regions, but rather chose OmegaPlus as the leading test.

780
Candidate regions were screened for potential causative mutations by examining related SNP 781 content and genomic features. Variants were annotated using SnpEff (v4.0) [43]. Genomic features