Updating Phylogeny of Mitochondrial DNA Macrohaplogroup M in India: Dispersal of Modern Human in South Asian Corridor

To construct maternal phylogeny and prehistoric dispersals of modern human being in the Indian sub continent, a diverse subset of 641 complete mitochondrial DNA (mtDNA) genomes belonging to macrohaplogroup M was chosen from a total collection of 2,783 control-region sequences, sampled from 26 selected tribal populations of India. On the basis of complete mtDNA sequencing, we identified 12 new haplogroups - M53 to M64; redefined/ascertained and characterized haplogroups M2, M3, M4, M5, M6, M8′C′Z, M9, M10, M11, M12-G, D, M18, M30, M33, M35, M37, M38, M39, M40, M41, M43, M45 and M49, which were previously described by control and/or coding-region polymorphisms. Our results indicate that the mtDNA lineages reported in the present study (except East Asian lineages M8′C′Z, M9, M10, M11, M12-G, D ) are restricted to Indian region.The deep rooted lineages of macrohaplogroup ‘M’ suggest in-situ origin of these haplogroups in India. Most of these deep rooting lineages are represented by multiple ethnic/linguist groups of India. Hierarchical analysis of molecular variation (AMOVA) shows substantial subdivisions among the tribes of India (Fst = 0.16164). The current Indian mtDNA gene pool was shaped by the initial settlers and was galvanized by minor events of gene flow from the east and west to the restricted zones. Northeast Indian mtDNA pool harbors region specific lineages, other Indian lineages and East Asian lineages. We also suggest the establishment of an East Asian gene in North East India through admixture rather than replacement.


Introduction
DNA polymorphisms reveal a population's genetic structure, migration and admixture in the past, susceptibility to illness and genetic causes of diseases. A phylogenetic approach is strongly recommended to avoid spurious positive associations between mtDNA mutations and diseases [1]. The pathogenic role of the mitochondrial genome requires more extensive surveys of the mtDNA sequences in different populations and patient groups. Technological improvements in DNA sequencing has made it possible to sequence complete mtDNA genome faster. Attempts have been made to reconstruct the phylogenies and prehistoric dispersal of modern humans in Europe, Africa, Oceania, East Asia, Southeast Asia and South Asia [2, with complete mtDNA sequence information.
The out-of-Africa scenario [25] has hitherto provided little evidence of the precise route by which modern humans might have left Africa. Two major routes of dispersal have been hypothesized: one is through North Africa into the Levant [26], and another is through Ethiopia along South Asia [27][28]. The proposed northern route of initial dispersal of modem humans from Africa could not be sustained by complete and in-depth analysis of mtDNA in recent times [29]. The mitochondrial haplogroup M which was first regarded as an ancient marker of East-Asian origin [30][31], had been found at high frequency in India [32] and Ethiopia [33], thus raising the question of its origin. The presence of M haplogroup in Ethiopia, named M1, led to the proposal that haplogroup M originated in eastern Africa, approximately 60,000 years ago, and was carried towards Asia [34]. Contrary to the above, in 2006, Olivieri [35] reported that about 40,000 to 45,000 years ago, predominant North African clades M1 and U6 arose in southwestern Asia and moved together to Africa. Their arrival temporally overlapped the event(s) that led to the peopling of Europe by modern humans and most likely the result of the same change in the climatic conditions that allowed humans to enter in to the Levant, opening the way to the colonization of both Europe and North Africa. In the light of above, the origins of Asian M lineage in Eastern Africa became ambivalent.

Results
The frequency distribution of M haplogroups has been shown in Table 1. In the present study, 12 novel haplogroups M53 to M64 ( Table 2) have been identified, and the phylogenetic status of previously identified haplogroups based on control region and/or coding region information have been ascertained or redefined from 26 tribal population based dataset (Fig. 1). The novel haplogroups are named according to the nomenclature system published elsewhere [47]. Phylogeny tree based on 737 (641 from our study and 96 from earlier studies) complete mtDNA sequences, for haplogroup M in India is shown in Fig. 2.

Novel haplogroups in India
The phylogeny trees of haplogroups M53 to M62 have been shown in Fig. S1. Haplogroup M53 encompasses ten samples from Kamar, 6 samples from Nihal, 4 samples from Pauri Bhuiya of the present study and R188 of [22]. Haplogroup (Fig. S2). Haplogroup M64 has been identified in Nihal population of central India (Fig. S2).

Refinement of previous haplogroups
A new subhaplogroup, M4c of M4 has been identified in Shertukpen (ST36) and Dirang Monpa (DR77) of Northeast India (Fig. S2). The frequency and diversity of haplogroup M5 reveals that it might have originated in central India and spread out to the eastern and western regions of India. Presence of M5a1b in Slavonic populations [48] and western Indians show its recent migration into the Eurasia. Novel subhaplogroups M5a3 to M5a5 have been defined in the present study while single sequences reported by [22] T13 and A64, B26 have been assigned to M5b and M5c haplogroups respectively (Fig. S4).
Haplogroup M6 has been redefined with 9 mutations, unlike in the earlier study [22] with 11 mutations. The haplogroup M6 has branched into M6a and M6b. M6a has further branched into M6a1 and M6a2 in the present study. The Lineages of Pauri Bhuiya, Munda, Hill Kolam and R56 of [22] have been classified under M6a1. Lineages R65 of [22] and P31 of [23], categorized under M6b earlier, have been assigned to M6a2 in the present study. Subhaplogroup M6b has been found in Korku (KK56) and Andh (AD27) of central India (Fig. S3).
M34b, a subhaplogroup of M34 has been newly defined, and another subhaplogroup M34a [23] has been redefined in the present study. Samples MN42 and PB103 have been grouped under M34a. NI37 of the present study along with C56 of [22] formed M34b (Fig. S6). Two new subhaplogroups M35b and M35c have been added to the existing M35 phylogeny tree. Subhaplogroup M35a has been reported from Betta Kuruba (8 samples), Andh (3samples), and a sample each from Nihal, Hill Kolam and Dongri Bhill. M35b encompasses 12 sequences of the study and sequence T17 of [22]. Two samples each from Kathodi and Andh have been categorized under M35c (Fig. S6). M35b, a founder lineage of Roma is present in gene pools of different Slavonic groups (such as Slovaks, Czechs, Poles, and Russians). It provides an evidence of Indian origin of Roma population [48].
In the present study, haplogroup M36 has been classified into 4 subgroups, M36a, M36b, M36c and M36d. This group consists of 33 sequences of Jenu Kuruba and one sequence of Kamar. Both the populations belong to Dravidian groups of South India (Fig. S7). Haplogroup M37 is characterized by mutation at sites 10556 and 152 [22]. Samples of Reddy (R45) and Rathwa (R1) have been classified under M37a [22][23]. Haplogroup M37 has been further classified into M37b, M37c, M37d, M37e1 and M37e2 in the present study. The lineages from Nihal and Kathodi have been named as M37b. Three samples of Dongri Bhill have been named as, M37c and two samples of Katkari have been named as M37d. Sample C26 of [22] has been assigned to subhaplogroup, M37e1. Subhaplogroup M37e2 consists of Dongri Bhill (DB110) and Pauri Bhuiya (PB87 and PB89) lineages. A Gallong sample (GL66) shares basal mutations of M37 and has been distinguished as a separate lineage with 14 private mutations (Fig. S2). Monophyletic origin of M38 and M18 [22] has been confirmed in the present study. Basal mutations of haplogroup M38 remain the same as earlier work [22]. We defined 2 new subhaplogroup of M38 as M38a and M38b. Lineages T72 and A24 of [22] has been reassigned to M38a. Subhaplogroup M38b has been further classified into M38b1 with two Korku and a Pauri Bhuiya lineages, and M38b2 with two Katkari and five Kathakur lineages.
Haplogroup M18 has been again redefined in the present study ( Fig. S2). Haplogroup M18 has high frequency in Malpaharia tribe (29%). Haplogroup M39 has been identified in 9 tribal populations from central, southern and eastern regions of India (Fig.S7).
Characteristics of M40 haplogroup are similar to the earlier works [22]. Samples T6, R59 of [22] and our 22 samples from seven tribal populations have been grouped under subhaplogroup M40a (Fig. S7). Haplogroup M41 and its sub branches M41a, M41b, M41c have been defined in earlier work [23]. Sub-  haplogroup M41a has been identified in Malpaharia and M41b in Madia population, whereas Kamar lineages have represented by a new subgroup, M41d in the present study (Fig. S7). M42 has been identified in 1 Pauri Bhuiya, 3 Madia and 3 Munda samples from our database, and the results have been published [43]. Haplogroup M43 has been identified in Dirang Monpa and Shertukpen of Northeast India and has been further classified into M43a and M43b (Fig. S7). Haplogroup M45 of [45] has been redefined in the study. It harbors sequences from Munda, Korku and Hill Kolam tribal populations of central India. Haplogroup M49 has been identified in Bhoi of Meghalaya [46]. In the present study, this haplogroup has been identified in 11 samples of Dirang Monpa, one sample each in Sonowal Kachari and Wanchoo of Northeast India. These Lineages cluster into a new subhaplogroup, M49a, whereas sample BH1 of [46] is assigned to subhaplogroup M49b (Fig. S7).

East Asian haplogroups in India
It has been interesting to identify major East Asian haplogroups M89C9Z, M9, M10, M11, M129G & D in India. East Asian lineages [1,18,49] have been identified on the basis of complete mtDNA sequences in the Northeast Indian populations. Several novel sub branches emerged from our study (Fig. S5), thus largely broadening our understanding of human dispersal in South-East Asia.
Haplogroups C&Z are sister subhaplogroups of M8 [50]. Under subhaplogroup C, C4a1 is defined in Han Chinese [1]. In the present work, a new lineage C4a1a has been defined for Lepcha, Lachungpa and Wanchoo populations. The Chinese sample (XJ8435) [1] has been reassigned to C4a1b instead of C4a1. Further, two new lineages, C4a3 and C4a4 have been assigned for Indian samples. Chinese sample (LN7710) [1] and samples of Dirang Monpa, Wanchoo and Gallong have been redefined as the subhaplogroup C7. Eleven Indian samples have been defined as C7a1, C7a2 and Sequence LN7710 of [1] has been reassigned to C7a3. In Gallong population, a new subhaplogroup C7b has been identified. Characterization of Z haplogroup is similar to the earlier work [18]. Four Dirang Monpa sequences have been grouped into a new subgroup Z6. Other Indian samples (Lepcha, Lachungpa and Dirang Monpa) have been named as Z3a while Gallong samples have been grouped under Z3b. Japanese sequence JD21 [18] has been reassigned as Z3c. The largest diversity of sister haplogroup C has been reported in Korea (100%) followed by central Asia (86%), and northern China (78%-74%). Therefore, C can be considered a clade with a Northeast Asian radiation [18]. Representatives of subhaplogroup Z extend from the Saami [4] and Russians [51] of west Eurasia to the people of the eastern peninsula of Kamchatka, the Russian Far East [52]. Its largest diversities are found in Korea (88%), followed by northern China (73%), and central Asia (67%), compatible with the hypothesis of central-east Asian origin of radiation for this haplogroup [18] (Fig. S5).
Haplogroup D has the highest frequency in central and East Asia including Japan. Sub lineages of D, D1, D2 and D3 denote Native American lineages [53]. D4 and D5 have been proposed for Asian lineages [50] whereas D6 has been marked for Japanese. In addition to D4a and D4b, 12 new branches (D4c to D4n) have been defined in Japanese populations [18]. In the present study, subhaplogroups D4b and D4j have been identified in Dirang Monpa, Lepcha, Toto, Wanchoo and Sonowal Kachari. The new sub branches D4p and D4q have been identified in this study. D4p has been identified in Sonowal Kachari and Lachungpa, whereas D4q has been identified in Dirang Monpa, Toto and Shertukpen. Gallong and Shertukpen lineages have genetic linkage with Japanese by sharing D4b2b haplogroup. Haplogroup D4j has been defined by transition at np 11696 [18] and is the most frequent one among the Northeast Indian populations (Toto, Gallong, Lepcha, Lachungpa, Wanchoo and Dirang Monpa) (Fig.  S4). Subhaplogroup D5a2 has been identified in Gallong, Sonowal Kachari and Wanchoo of North East India. The geographic distribution of D lineages is peculiar. For example, D5 is prevalent in southern China. D4a is abundant in Chukchi of Northeast Siberia, but D4a1 and D4n have its highest frequency in the Japanese populations [18]. Whereas, D4j is frequent in Northeast Indian populations.
Haplogroup E shares M9 defining mutations [1]. We followed the haplogroup nomenclature of 2009 by [54] for consistency. Indian samples (LA70, LA32, DR46, DR100), a Chinese sample (XJ8420) and a Japanese sample (PD11) are clustered under M9a3. Another 17 Indian samples have been clustered into M9d lineage. M9 has a central and eastern Asian geographic distribution, and it has reached its greatest frequency (11%) in Tibet. Present Indian samples, which consist of halopgroup M9, are geographically adjacent to Tibet. In addition to mainland Japanese, M9 has been detected in the indigenous Ainu and Ryukyuans [55] (Fig. S5). Haplogroup M10a [1] has been identified in Gallong population. Although its highest frequency is among Tibetans (8%), rich diversity is found in China. It is present among Koreans and mainland Japanese, but has not been detected in either Ainu or Ryukyuans [18] (Fig. S5). In the present study, M11a has been redefined and assigned to Chinese, whereas M11b has been assigned to Japanese [18]. Indian samples (GL19, GL80, GL88, and WA94), clustered under M11a, indicate a genetic affinity with East Asians (Fig. S5). Mutation at np15924 found at the root of M11 and M12 in Japanese [18], has been absent both in Indian and Chinese samples.
In the present study, haplogroup M129G defining mutation site (at np 14569) is similar to the definition in earlier work [1]. Sample GD 7825 of [11] and our sample GL31 have been assigned to a new subgroup, M12a. Samples of Pauri Bhuiya (PB8 and PB119) have been defined as a new group M12b. Subgroup G2a1a which is present among Japanese has been identified in Wanchoo and Lachungpa populations. Novel subhaplogroups G2c, G2d and G6 have been defined in the present study. One lineage each of Gallong (GL61) and Kathodi (KD106) form subhaplogroup G2c. Subhaplogroup G2d harbors Lachungpa samples. G6 has been found in Lachungpa and Dirangmonpa populations. The frequency distribution of G2 is abundant in northern China and central Asia, reaching higher frequencies in the southern Siberia. Clades G3 and G4 have been apparent in Japanese. Subgroup G5 is dominant in northeastern Siberia. However, G1a1 has the highest frequencies in a cluster embracing Japanese and Koreans [18] (Fig. S5).

Age estimates
The age estimates of the M haplogroup using coding region mutation rate (1.2660.08610 28 ) [12] have been listed in Table 3 Coalescence time of macrohaplogroup M in India has been estimated using synonymous mutation rate (3.5610 28 ) [56] which is (36,00063,000 years) less than the estimate (46,00065,000 years) by [56] for M haplogroup in Asia.
The total rho estimate for haplogroup M is 9.960.5 (Table 3). It includes all the Indian lineages and also East Asian lineages. After excluding the East Asian lineages (M89C9Z, M9, M10, M11, M12-G, D), the total diversity estimate for haplogroup M in India is 8.760.5. It is similar to the earlier works, i.e., 8.760.6 [22].
Recurrent mutations generated in each network are summarized in Table S1. Total number of variables sites in the present study is 1092. Out of 1092 variable sites, 270 (24%) had mutated more than once. Of 269 sites, 15 sites mutated 4 or more times. Eleven hotspots were reported in [2,56] and 3 hotspots (nonsynonymous) are reported in the present study.

Discussion
The Indian mtDNA phylogeny (Fig. 2) [48]} from published sequences. It reveals extensive maternal variations emerging from the largest number of deeply rooted autochthonous lineages, reflecting the diversity of populations residing in the subcontinent, who are biologically and culturally distinct. Hierarchical analysis of molecular variations show significant differentiations (FST = 0.16164) and sub divisions among the populations, with a large fraction of the variance found within populations (83%) ( Table 4). Individual population contribution to the global FST measure has not deviated much from the average (ranges 15-17 per cent), indicating that the degree of evolution of all the populations from a common ancestral population is similar, without any special evolutionary constraints. The problems faced by the earlier work [22] in constructing the Indian mtDNA phylogeny tree with 70 sequences, have been resolved to some extent in this study. For example, the monophyly of M18938 or the actual placement of the branch referred to as M4a within M4 has been confirmed. In the present study, coding region mutations have been considered for assigning new haplogroups, as hyper variable control region sites lead to confusing conclusions, which are evident from global mtDNA phylogeny, based on complete mtDNA sequences. In this study, 12 novel haplogroups and 25 already defined haplogroups clearly outnumber the basal variation of macrohaplogroup M in any region of the globe. The haplogroup M frequency ranges from 50 per cent in Kathodi, Katkari and Gallong to 97 per cent in Jenu Kuruba with an average frequency of 70 percent, which has been consistent with earlier works [28,32,37,39,[57][58][59][60][61][62]. The haplogroup M has high frequency in India and drops abruptly to about 5% in Iran, marking the western border of the haplogroup M distribution [37]. The maternal gene flow in and out of India has been limited since the initial settling of Indian maternal lineages. An eastern and western Eurasian lineage ranges from 10-12 percent in India [37]. Low frequencies of western Eurasian haplogroups in India [32,38,63]   Eurasian-specific mtDNA haplogroups, reaching a peak of nearly 50% in the Kanet of Himachal Pradesh [37]. In the present study, fair frequencies of eastern Asian haplogroups were observed in the North East Indian populations ( Table 1). The current Indian gene pool has been reshaped in situ after initial mtDNA pool was established and galvanized by relatively minor events of gene flow from the West and from the East into India through admixture. The Indian mtDNA pool consists of several deep-rooted lineages of macrohaplogroup 'M' suggesting in situ origin [22][23][36][37]. It is apparent that all the ancient lineages under analysis emerge directly from the root of the macrohaplogroup M. Asian phylogenetic trees have been broadened in the present study with additional Northeast Indian data. For example Seq XJ8435 [1] of C4a1 has been further assigned to C4a1b and Indian samples (LA50, LA61, LP67, WA46 and WA105) have been classified into C4a1a. Apart from C4a1 and C4a2 of East Asian Phylogeny tree, C4a3 and C4a4 have been defined in the present study. Northeast Indian tribes, particularly Tibeto-Burman linguistic groups indicate genetic affiliation with East Asians. This is in agreement with the earlier works: mtDNA evidence [64][65], Y chromosome evidence [66][67] and linguistic evidence, [68]. In Northeast India, D4b2b, D4j, D5a2, C4a, C7, M9a, M10a, M11a, M12 and G2a1a haplogroups have the resultant of Last Glacial Maximum (about 20,000 years ago) migrations from southern China and is admixed with local initial settlers.

Origin of Macrohaplogroup M
L3 lineages other than M and N are absent in India and among non-African mitochondria in general [2][3]49]. M, N and R haplogroups of mtDNA have no indication of an African origin. However, it is proposed that the origin of haplogroup M is in Africa [34], in view of its high frequency in Ethiopia. But in 2006, by [35] demonstrated that the presence of M1 and U6 in Africa is due to a back migration. Sequencing of 81 entire human mitochondrial DNAs belonging to haplogroups M1 and U6 revealed that these predominantly North African Clades arose in Southwestern Asia and moved together to Africa about 40,000 to 45,000 years ago. Only some sub-sets of M1a (with an estimated coalescence time of 28.864.9ky), U6a2 (with an estimated coalescence time of 24.067.3ky), and U6d (with an estimated coalescence time of 20.667.3ky) diffused to East and North Africa through the Levant, leaving the origin of macrohaplogroup M unresolved. Haplogroup M has been found ubiquitous in India, although its frequency is somewhat higher in southern Indian populations than in northern Indian populations and to a large extent autochthonous because neither the East nor the West Eurasian mtDNA pools include such lineages at notable frequencies [37,58]. Our findings, (for example, deep time depth

Migration routes of modern human
Recent mtDNA evidence on modern human out of Africa migration route suggests a single dispersal by a southern coastal route to India and further, to East Asia and Australia [17,20,22,23,66,69]. The North Asian route could not get support from mtDNA due to the lack of basal M, R, N lineages in northern Asians, thereby ruling out the existence of a northern Asian route [29][30][70][71]. Proven back migration of sub lineages of M and U into Africa [35], and the absence of L3 lineages or ancestral lineage for L3, M and N in India, leaves two issues unresolved: evidences for the southern route hypothesis from India and origin of M haplogroup. However, in the present study, the basal diversity (37 nodes) and founder ages (57,000-75,000 years) of macrohaplogroup M in India reveals initial settlement of African exodus in India. Our database also reveals evidences that Andaman islanders and Australians have ancestral maternal roots in India [24,43].
In summary, the present study provides evidence that several Indian mtDNA M lineages are deep rooted and in situ origin. In North East India the coalescent time of East Asian lineages dates back to Last Glacial Maximum (LGM). Further, the combination of virtually all previously reported lineages from South and East Asia and our newly produced Indian complete mtDNA sequences have helped to define several novel (sub) haplogroups. The present work further ascertained previously reported haplogroups, and refined the phylogenetic tree of South Asia. This updated phylogenetic tree provides an essential reference guide for diseases, anthropological and forensic studies among Asian populations.

Methods
The Indian populations are organized into 4365 communities [72], which include selfdefined castes, tribes and religious groups. About 450 tribes constitute 8.08% (2001 census) of the total Indian population. They speak more than 750 dialects [73], which can be broadly classified into Austro-Asiatic, Dravidian, Tibeto-Burman and Indo-European language families. The tribes are endogamous in nature and socio-culturally distinct. They inhabit mostly in the forests and hilly terrain areas. Government of India has notified 75 tribes as the most primitive group among the original inhabitants of India. Out of 75 primitive tribal groups, Anthropological Survey of India has selected 26 tribes inhabiting the western, central, southern and eastern parts of India, representing 4 major linguistic families, namely Dravidian, Indo-European, Austro-Asiatic and Tibeto-Burman and collected 2,783 blood samples for the present study Fig.1.
The Ethical Committee of the Anthropological Survey of India approved the project. 5-10 ml of blood was drawn from healthy and unrelated individuals after obtaining written consent. Samples were collected in Vacutainer as per standard protocols, and extraction of DNA was performed according to the enzymatic extraction procedure followed by phenol purification [74], which was standardised at Anthropological Survey of India, C.R.C. were selected for complete sequencing. After checking the quality of sequences, 12 ambiguity sequences removed from final analysis. 641 complete mtDNA sequences were included in the final analysis for the present study and the results of 97 sequences published elsewhere [24,43,75]. Complete sequencing was done using 24 pairs of both forward and reverse primers [76]. Sequences were assembled, and edited using SeqScape 2.5. Mutations were scored relative to the revised Cambridge Reference sequence [77]. Deviations from the rCRS were confirmed by manual checking of their electropherograms. Phylogenetic relationships among the sequences were determined by Median-joining net work analysis with the help of Networking 4.1 software. Most parsimonious trees of the mtDNA haplogroups were reconstructed manually following a parsimony approach, and confirmed by the program Networking 4.1. The founder ages and time of TMRCA have been calculated as implemented in [21]. The age of the founder mtDNA type has yielded a time estimate for its arrival in the continent. It includes the ancestral nodes that were shared by its variants in the tree. The ages of haplogroups M are estimated from 736 lineages based on mutation rate 1.2660.08610 28 [12]. The ages also calculated by using substitution rate estimate for protein-coding synonymous change of 3.5610 28 [56] manually using Rho estimate [53]. The variance of Rho was estimated [78] for both the methods. Nevertheless, all ages calculated without evidence to sustain the assumption of the molecular clock mean that estimation of the associated error values [78] is only an approximation. AMOVA was performed to evaluate the amount of genetic structure among the tribal population using Arlequin var 3.11 [79].

Quality Control
Out of 1751 M samples, 750 samples were selected for complete mtDNA sequencing. Sequence reactions were carried out with a BigDye terminator cycle sequencing FS ready reaction kit (Applied Biosystems) to produce even signal intensities and to reduce false negatives. It enabled more accurate automated mixed base identification. Sequencing data that were generated on Applied Biosystems 3730 DNA analyzer were analyzed in SeqScape software V 2.5. KB base caller V 1.4 was used in the analysis protocol. KB base caller process florescence signal assigns a base to each peak and assigns quality value (QV) to each base. The QV predicts the probability of a base call error. KB base caller generated QV from 1 to 99. Typically high quality pure bases will have QV ranging from 20-50 (Probability of Error is 1% to 0.001%). Mixed bases were identified if the secondary peak height threshold value was .25%. To set clear range of the sequence quality value method (Remove base from the ends until fewer than 4 bases out of 20 have QVs,20) was used. Filter setting values used were: Maximum mixed bases = 20, Minimum sample score = 25. Depending on the sequence quality and the criteria specified for filtering the data prior to assembly, the samples were not assembled. These unassembled samples were re-sequenced until it satisfied the quality. Editing of data and scoring of mutations were done by two independent groups of researchers. Phylogenetic network was performed and some errors were identified (mixing of contigs etc). 12 Unresolved samples, ambiguity sequences, low quality sequences, error sequences were eliminated for final analysis. 641 complete mtDNA sequences were included in the final analysis for the present study and the results of 97 sequences published elsewhere [24,43,75]. To check the reliability of the data, we calculated and compared the diversities with the earlier work [22]. The diversity values corroborated with the earlier work. Further, to ascertain the quality of the results, recurrent mutations generated by the individuals' tree networks were summarized and considering the work by [52] as a reference point, hotspots were rechecked.
All the sequences have been deposited in the NCBI database (Accession Numbers: FJ 383814 to FJ 383174).

Post script
Haplogroup nomenclature conflict. Global mtDNA tree at http://www.phylotree.org presented previously published as well as newly identified haplogroups M51 and M52 in the study [54]. While our paper is under review another study [80] defined haplogroups M51, M52, M53. Whereas M53 name was given to the already defined M45. Thus nomenclature conflict exists between the two studies. Haplogroups M51 and M52 of [80] coincide with our M54 and M58 respectively. We followed mtDNA tree at http://www.phylotree.org and named our new haplogroups from M53 to M64.  Figure S1 Indian mtDNA phylogenetic tree of macrohaplogroup M. Suffixes A, C, G, and T indicate transversions, ''d'' indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. Found at: doi:10.1371/journal.pone.0007447.s002 (3.76 MB TIF) Figure S2 Indian mtDNA phylogenetic tree of macrohaplogroup M. Suffixes A, C, G, and T indicate transversions, ''d'' indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. Found at: doi:10.1371/journal.pone.0007447.s004 (3.69 MB TIF) Figure S4 Indian mtDNA phylogenetic tree of macrohaplogroup M. Suffixes A, C, G, and T indicate transversions, ''d'' indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. Found at: doi:10.1371/journal.pone.0007447.s005 (3.94 MB TIF) Figure S5 Indian mtDNA phylogenetic tree of macrohaplogroup M. Suffixes A, C, G, and T indicate transversions, ''d'' indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. Found at: doi:10.1371/journal.pone.0007447.s006 (3.84 MB TIF) Figure S6 Indian mtDNA phylogenetic tree of macrohaplogroup M. Suffixes A, C, G, and T indicate transversions, ''d'' indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. Found at: doi:10.1371/journal.pone.0007447.s007 (3.57 MB TIF) Figure S7 Indian mtDNA phylogenetic tree of macrohaplogroup M. Suffixes A, C, G, and T indicate transversions, ''d'' indicates a deletion, and a plus sign (+) indicates an insertion; 9bpins means 9-bp insertion (CCCCCTCTA) in the COII/ tRNALys intergenic region. The A/C stretch length polymorphism in regions 16180-16193 and 303-315 and mutation 16519, all known to be hyper variable, were disregarded for tree reconstruction; recurrent mutations are underlined and the @ indicates back mutation. Samples code names were given in fig. 1.Samples collected from published sources were referred by symbols SU [22], TK [18], KG [11], TG [23,43], KS [52], IG [3], BM [44], HE [2] and MC [48] followed by ''#'' and the original sample code. Haplogroup names indicated in Blue are defined in the earlier works, pink are redefined and red are newly identified in the present study. Coalescence times are based on synonymous mutation rate 3.5X10ˆ-8 [52]. Found at: doi:10.1371/journal.pone.0007447.s008 (3.59 MB TIF)