Deep Rooting In-Situ Expansion of mtDNA Haplogroup R8 in South Asia

Background The phylogeny of the indigenous Indian-specific mitochondrial DNA (mtDNA) haplogroups have been determined and refined in previous reports. Similar to mtDNA superhaplogroups M and N, a profusion of reports are also available for superhaplogroup R. However, there is a dearth of information on South Asian subhaplogroups in particular, including R8. Therefore, we ought to access the genealogy and pre-historic expansion of haplogroup R8 which is considered one of the autochthonous lineages of South Asia. Methodology/Principal Findings Upon screening the mtDNA of 5,836 individuals belonging to 104 distinct ethnic populations of the Indian subcontinent, we found 54 individuals with the HVS-I motif that defines the R8 haplogroup. Complete mtDNA sequencing of these 54 individuals revealed two deep-rooted subclades: R8a and R8b. Furthermore, these subclades split into several fine subclades. An isofrequency contour map detected the highest frequency of R8 in the state of Orissa. Spearman's rank correlation analysis suggests significant correlation of R8 occurrence with geography. Conclusions/Significance The coalescent age of newly-characterized subclades of R8, R8a (15.4±7.2 Kya) and R8b (25.7±10.2 Kya) indicates that the initial maternal colonization of this haplogroup occurred during the middle and upper Paleolithic period, roughly around 40 to 45 Kya. These results signify that the southern part of Orissa currently inhabited by Munda speakers is likely the origin of these autochthonous maternal deep-rooted haplogroups. Our high-resolution study on the genesis of R8 haplogroup provides ample evidence of its deep-rooted ancestry among the Orissa (Austro-Asiatic) tribes.


Introduction
India is a melting pot of multi-lingual populations with a unique complex genome diversity [1]. The linguistic diversity prevalent among Indian populations is associated with the presence of four linguistic families: Dravidian (DR), Indo-European (IE), Austro-Asiatic (AA) and Tibeto-Burman (TB) [1]. Of these four groups, AA tribes are considered to be the first settlers of the Indian subcontinent, representing about 30 endogamous tribal populations [2]. The AA linguistic family is traditionally divided into two basic subfamilies: Mon-Khmer and Mundari [3]. Among these two subfamilies, Mundari speakers, the traditional hunter-gatherers, are exclusively found in the Indian subcontinent [3][4]. Because Mundari populations are considered to be the earliest inhabitants of the Indian subcontinent, their migration during demic expansion of the agriculturalists in the Neolithic era, as has been suggested for Mon-Khmer speaking Nicobarese [5], appears doubtful.
Though numerous studies have been carried out on the phylogenetic characterization of haplogroup R, there is a dearth   [16]. Additional sequences were taken from the literature and referred by symbols CS#P and CS#C [3,9]. Suffixes A, C, G and T indicate transversions; ''d'' denotes deletion and plus sign (+) denotes an insertion; recurrent mutations are underlined; since the variation at 16519 is extremely hypervariable and so not shown here. doi:10.1371/journal.pone.0006545.g002 of research on its subhaplogroups. To the best of our knowledge, only eight complete mtDNA sequences of haplogroup R8 are available in the database [3,9]. Therefore, we aim to more accurately trace the genealogy and pre-historic expansion of haplogroup R8 into the Indian subcontinent.

Results
We analyzed a total of 5,836 samples from 104 populations across the Indian subcontinent ( Figure 1) and identified 54 samples containing haplogroup R8 (Figure 2 & 3). The R8 haplogroup is defined by 13215-9449-7759-3384-2755 sites in the coding regions and single site (195) in the control region. Those HVS-I motifs of Indian populations previously defined as West Eurasian haplogroup H, when matched with revised Cambridge Reference Sequences (rCRS) [6,13] are now redefined as haplogroup R8. The topology of the previously characterized R8 samples A165, A190, S4, [9] and recently classified Ko74, CoB41, Ko30, Ko37 and Lam10 samples [3] deviates significantly with our samples. A190 [9] grouped with our samples of Panika, Mudiraj, Dommari and Sugali, whereas S4 grouped with Lam10. Upon complete sequencing of the 54 samples, we identified 9 novel subhaplogroups of haplogroup R8. The coalescent age for haplogroup R8 is 41. HVS-I sequences of the individuals within the R8 haplogroup and who belonged to 30 different ethnic populations, were subjected to estimate intra-population diversity. The diversity indices and neutrality test values are presented in Table 1. The Tajima's and Fu's F s values showed significantly negative values in 18 and 26 populations, respectively ( Table 1)  A similar trend was also observed in the mean number of pairwise differences: Savara 6.561 (3.23), Bhumia 6.269 (3.11), Gadaba 4.808 (2.47), Dhurva 4.933 (2.54) and Bonda 4.837 (2.43).
We have carried out principal component analysis (PCA) to explore the affinities among the populations possessing haplogroup R8, based on the frequency distributions. The PCA plot identified close affinities among the Orissa tribes belonging to the Austro-Asiatic linguistic family ( Figure 5). Combined, PC1 and PC2 accounted for a 63.70% variance in the data.
Existence of a comparatively high frequency of R8 in Orissa populations, especially among the AA-speaking Mundari tribes, strongly suggests that this haplogroup might have originated among the maternal ancestors of the contemporary AA speakers of the region. To substantiate this hypothesis, we estimated the coalescence time and corroborated with archeological evidence. The time for most recent common ancestors (TMRCA) of R8 (41.767.3 Kya) and its subclades R8a (15.467.2 Kya) and R8b (25.7610.2 Kya) divulges the ancient demographic history of this haplogroup (Figure 2). This haplogroup (R8) is also present in low frequency among the Dravidian and Indo-European speaking family, which can be explained by a language shift or local admixture with the AAspeaking family. Interestingly, this haplogroup was not found in any of the Tibeto-Burman populations analyzed in the present study.
A contour map of the R8 haplogroup revealed its distribution in different geographical regions (Figure 4). It is quite evident from the map that the frequency of this haplogroup is concentrated towards Orissa, Gujarat, Chattisgarh and Jharkhand with highest frequency in Orissa (12%). The Spearman's rank correlation analysis demonstrated a significant correlation of R8 haplogroup frequency to latitude and longitude (p,0.05), strong evidence for the relation of genes and geography to this group.
The significant negative values obtained from neutrality tests support the hypothesis of population growth. The PCA plot ( Figure 5) found close affinities among the Orissa (AA tribe) population, perhaps due to the high frequency and influence of the R8 haplogroup.
High-resolution study on the origin of the R8 haplogroup provides abundant evidence of its deep-rooted ancestry among the Orissa (AA) tribes. The TMRCA estimates revealed that the initial maternal colonization of this haplogroup occurred during the midto-late Paleolithic period, roughly 40 to 45 Kya. The significant relation between the genes and geography is attributed by the spatial analysis of this haplogroup. Moreover, the absence of haplogroup R8 and its subhaplogroups among the Tibeto-Burman speaking populations studied implies socio-cultural practices existing among the populations to be the principle factor for genetic demarcation. Thus, the phylogeographic reconstruction of 54 complete mitochondrial sequences containing haplogroup R8 furnished a better understanding of this partially-characterized haplogroup. Our high-resolution analysis again provided a detailed coding region information for proper classification of a sample, especially in the case of the South Asian haplogroups, which contain several deep-rooted lineages sharing identical coding region mutations with the exception of the HVS-I [14][15].

Ethics Statement
All DNA samples analyzed in the present study were derived from blood samples collected with informed written consent according to protocols approved by the Institutional Ethical Committee of CCMB, Hyderabad.
The samples used in this study were obtained from the DNA bank of CCMB. We have screened a total of 5,836 individuals belonging to 104 ethnic populations from 17 states of India (see Figure 1; Supplementary information Table S1), initially for HVS-I (16000 to 16400) followed by nucleotide position at 3384. Among the 5,836 mtDNA screened, 54 were found to contain basal mutations 13215-9449-7759-3384-2755 which define haplogroup R8. 24 sets of primers were used in sequencing the complete mtDNA. Sequencing of PCR amplicons was performed using the BigDye terminator cycle sequencing kit and ABI 3730XL DNA analyzer (Applied Biosystems, Foster City, USA). The sequences were edited and assembled using AutoAssembler (version 1.4) software (Applied Biosystems, Foster City, USA) to obtain a consensus sequence. These sequences were aligned with rCRS and the mutations were noted [16].
NETWORK (version 4.5) software (www.fluxusengineering. com) was used for phylogenetic reconstruction [17]. The phylogeny obtained was reconfirmed by means of a neighborjoining tree (10006bootstrapped) [18], using MEGA (version 4.0) software [19]. We followed the nomenclature system of Richards et al. [20] for reconstructing the phylogenetic tree of haplogroup R8. The isofrequency map for haplogroup R8 was constructed using the Kringing method [21] in the Surfer (version 8.0) program designed by Golden software (Golden Software Inc., Golden, Colorado). Spearman's Rank correlation coefficients between mtDNA haplogroup frequency and latitude and longitude were calculated in StatistiXL (version 1.8) software (StatistiXL, Nedlands, Western Australia) with a p-value,0.05 considered statistically significant. Principal Component (PC) analysis of R5-R8, R30 and R31 lineages in different Indian populations was performed using SPSS (version 11) software (SPSS Inc., Chicago, IL, USA) with mtDNA haplogroup frequencies as an input vector. Coalescence time was calculated using sequence positions between nucleotides 577 to 16023 considering one base substitution per 5,140 years, excluding insertions and deletions [22]. Standard deviation of the rho (s) estimate was calculated based on Saillard et al. [23]. Descriptive statistical indices and Neutrality tests (Tajima's D, Fu's F s ) for HVS-I sequences were calculated using Arlequin (version 2.0) software [24]. Complete mtDNA genome sequences generated in this study were submitted to GeneBank (accession numbers FJ467940-FJ467993).

Supporting Information
Table S1 List of the caste and tribal population studied.