Disclosing the Genetic Structure of Brazil through Analysis of Male Lineages with Highly Discriminating Haplotypes

In a large variety of genetic studies, probabilistic inferences are made based on information available in population databases. The accuracy of the estimates based on population samples are highly dependent on the number of chromosomes being analyzed as well as the correct representation of the reference population. For frequency calculations the size of a database is especially critical for haploid markers, and for countries with complex admixture histories it is important to assess possible substructure effects that can influence the coverage of the database. Aiming to establish a representative Brazilian population database for haplotypes based on 23 Y chromosome STRs, more than 2,500 Y chromosomes belonging to Brazilian, European and African populations were analyzed. No matter the differences in the colonization history of the five geopolitical regions that currently exist in Brazil, for the Y chromosome haplotypes of the 23 studied Y-STRs, a lack of genetic heterogeneity was found, together with a predominance of European male lineages in all regions of the country. Therefore, if we do not consider the diverse Native American or Afro-descendent isolates, which are spread through the country, a single Y chromosome haplotype frequency database will adequately represent the urban populations in Brazil. In comparison to the most commonly studied group of 17 Y-STRs, the 23 markers included in this work allowed a high discrimination capacity between haplotypes from non-related individuals within a population and also increased the capacity to discriminate between paternal relatives. Nevertheless, the expected haplotype mutation rate is still not enough to distinguish the Y chromosome profiles of paternally related individuals. Indeed, even for rapidly mutating Y-STRs, a very large number of markers will be necessary to differentiate male lineages from paternal relatives.


Introduction
From the genetic point of view, Brazil is known as one of the most heterogeneous population in the world, with an important genetic contribution from three main continental groups: Europeans, Africans and Native Americans.
The first people arriving in Brazil were Europeans, coming mainly from Portugal, who arrived in 1500 to a territory that was already inhabited by the Native Americans for at least 11,000 years [1].
During the slave trade period, which officially lasted from 1538 to 1850, a huge number of African people were forced to move to Brazil. During that period, approximately 3.6 million slaves are estimated to have entered the country [2,3]. After the abolition of slavery in Brazil in 1888, a new important migration wave took place, extending the admixed process to new European immigrants. The number of new incomers was approximately 6 million, coming from Portugal, Italy, Spain, Germany, Syria, Lebanon and Japan [4].
At the same time that people from diverse countries and continents were arriving to different regions in Brazil, important movements were taking place inside the territory, mainly due to economic interests. These internal movements gained a new impetus after the First World War, between 1914 and 1918, mainly from the northeast to the north and southeast regions of the country [5].
Consequently, the modern Brazilian population is genetically very diverse and considered to be very heterogeneous when considering the 5 main geopolitical regions of the country: (i) the northern region hosts the people with the largest Native American ancestry; (ii) northeast region has the highest African contribution; (iii) the southeast and (iv) the south are the regions where the European contribution is more important, and; (v) the central west was the last colonized region by the influx of people coming from all the other Brazilian regions, mainly from the northeast and southeast [4,6].
Large efforts for data collection are therefore required to actually capture the genetic diversity expected in such a large and heterogeneous country. Representative population databases are very important to correctly define allele, haplotype and genotype frequency distributions, which is essential for accurate statistical inferences in i) kinship analysis or identification in criminal cases; ii) in the study of the origins and history of human populations and their genetic relationships; and iii) to study different events, like selection, that can be acting on populations [7].
Very high diversity levels can be obtained when studying a large number of Y chromosome specific loci. In most human populations, a relatively small number of 12 to 17 STRs is expected to produce a percentage of more than 90% different haplotypes in most medium-sized population samples of 100 to 500 non-related individuals (see [8][9][10][11][12][13][14][15][16][17] for some examples).
Although several studies have been published concerning the Y-STR variability in a large number of populations worldwide, major concerns still exist on the weight of most databases available for a wide range of applications, for example, those used to estimate haplotype frequencies for forensic investigation purposes [18]. A strong statistical evaluation of forensic evidences based on Y chromosome genetic profiles can only be obtained if it is supported by a sample that is sufficiently large to represent the diversity present in the reference population. Moreover, in admixed and/or substructured populations, the variability existing between different regions and ethnic groups should be reflected in the composition of the global database [19].
Concerning the Brazilian population, the available Y-STR haplotypic information is still poor, being represented by a relatively low number of samples for a limited group of markers. Most studies include no more than the 9 Y-STR loci from the so called ''minimal haplotype'' [20] or the 17 Y-STR set of the commercial kit YFiler (Applied Biosystems), which are the most widely used in forensic genetics. Still, the power of discrimination for these markers is not enough to avoid a significant proportion of shared haplotypes within the studied population samples [8][9][10][11][12][13][14].
In the present work, we analyzed more than 2,000 Y chromosomes from the five different Brazilian regions aiming to establish a representative population database of the country for haplotypes based on 23 Y chromosome STRs. In comparison to the more commonly used sets of 9 to 17 Y-STRs, we also aimed to evaluate the capacity of this extended battery of 23 markers to discriminate non-related individuals, as well as to calculate the expected probability of finding different profiles in paternal relatives due to mutation(s).

Results and Discussion
A set of 23 Y chromosome STRs was investigated in a sample of admixed Brazilians from 17 different cities ( Fig. 1) to evaluate the differences that could exist within and between populations from the 5 studied regions of Brazil. For this purpose, we studied three reference samples from Native Americans, Europeans and Africans because they represent the main ancestral sources of modern Brazilian populations. A list of the observed haplotypes in all studied populations from Brazil (admixed and native samples), Europe (Portugal) and Africa (Ivory Coast, Cameroon, Mozambique, Democratic Republic of the Congo, Angola) is included in Table S1.
The compiled data were used to compare the Y-STR haplotype genetic profiles of the 17 different admixed populations in Brazil grouped in 5 different geopolitical regions. Additionally, the overall diversity found in the 5 different regions in Brazil was also compared with that observed in other populations.

Genetic Distances and Molecular Variance Analysis
Pairwise genetic distances were estimated based on the sum of squared size differences (RST) between the haplotype distributions observed for the 23 Y-STRs in the 17 Brazilian admixed population samples. No significant genetic distances were found between any pair of populations within or among the five different regions of Brazil, except in the comparison of Roraima with Amazonas and Mato Grosso do Sul (Table S2).
In the MDS representation of pairwise R ST genetic distances (Fig. 2), we can see that all the studied populations stand very close to the Europeans, showing a low Native American or African paternal contribution, which is repeatedly observed in admixed urban samples from different Brazilian populations studied to date (e.g., [15][16][17]).
To see if the observed Y-haplotype distribution is structured by region, we have also included in the MDS plot ( Fig. 2) the samples corresponding to the 5 different Brazilian regions, obtained after pooling the data from the 17 urban samples. No patterns emerge from these analyses, and the Brazilian populations are dispersed on the plot without any kind of grouping around the region with which they belong. There are, nevertheless, two populations that stand apart from the main group, Roraima and Mato Grosso do Sul. These populations present significantly high values of genetic distances to all of the three reference groups from Europe, Africa and Native America and, therefore, the observed results cannot be explained by a higher proportion of African or Native American chromosomes but instead it seems to be the result of drift. This slight differentiation of Roraima is supported by the low nondifferentiation p-values, which in two cases presented statistical significance (Table S2).
Nevertheless, no conclusions can be drawn about the sample from Mato Grosso do Sul if we take into account that in this case the high R ST values are associated with non-differentiation p values clearly above the significance level, meaning that although drift exists, it can be a consequence of a low sample size. Finally, when grouping the 17 populations in the five different Brazilian regions, AMOVA showed that almost all genetic variation is within populations (fixation indexes: F SC = 0.00055, F ST = 0.00002 and F CT = 20.00053), with no significant variation within or between groups (significant test after 10100 permutations, p$0.30).
In summary, for the 23 STRs included in the present work, no significant differences could be detected concerning haplotype distributions within or between the northern, northeastern, central western, southeastern and southern regions. Therefore, subsequent analyses were made just taking into account these 5 regions or the entire Brazilian population.

Analysis of Genetic Diversity
To evaluate the utility of increasing the number of STRs in population and forensic genetics, we compared the results obtained when using four different groups of markers: (i) the 9set belonging to the minimal haplotype (MH) described by Kayser et al. [20]; (ii) the 17-set loci included in the YFiler kit; (iii) a 14-set comprising the 8 YFiler loci not included in the MH plus 6 new selected loci; (iv) and the total set of 23 markers included in the present work.
The diversity indices, calculated using the four different Y-STR sets in the five Brazilian regions, are presented in Table 1, which also includes the results obtained for the three reference population samples.
When comparing single locus diversities, similarly high values were obtained in all loci in the 5 different Brazilian regions. Nevertheless, on average, the northeastern region presents slightly higher diversity values than those observed in the remaining regions for any of the 4 selected Y-STR sets. Genetic diversities above 0.50 have been observed in all Y-STRs in north, northeast and central west. The south and southeast regions have slightly lower single locus diversities, and although not significantly different, the lowest average values calculated for the 4 Y-STR sets were always detected in the southeastern region.
In the global sample from Brazil, single locus diversities were consistently higher than in the three reference samples, which can be somewhat explained by the admixture of very heterogeneous male contributions. As observed in previous studies for lineage markers [21] as well as for markers in the autosomes [22], the Native Americans have the lowest genetic diversity. The lower levels of polymorphisms present in Africans in comparison to Europeans is most likely explained by the selection criteria of the analyzed markers that were initially based on diversity levels found in Europeans.
At the haplotype level, the same results were obtained, with northeastern Brazil showing the highest diversity and southeastern the lowest.
When analyzing the discrimination levels produced by the 5 different sets of Y-STRs, for the 9-set 50% of the haplotypes were found in more than one Y chromosome in our Brazilian sample, and the number of different haplotypes corresponds to 62% of the total sample. As expected, these values increase when using larger sets, but are higher when considering the 14-set (including the new selected markers) than the 17-set of markers in the YFiler kit (see haplotype sharing and diversities in Table 1).
For the group of 23 Y-STRs included in this study, out of 2,024 Y chromosomes that were typed from the Brazilian admixed populations, a total of 2,010 were found only once, with just seven chromosomes being shared between different samples, namely Pará (Belém) and Minas Gerais; Amapá and Maranhão; Rondônia and Rio Grande do Sul; Rondônia and Ceará; Amazonas and Maranhão; Maranhão and São Paulo; and São Paulo and Rio Grande do Sul.
In summary, the 6 additional STRs allowed an increased discrimination capacity of Yfiler, the commercial kit most widely used in Y-STR typing, by decreasing the percentage of shared haplotypes between individuals from 6.2% to 0.3% and the matching probability from 1 in 1,700 to 1 in 2,000.

Ability to Distinguish Paternal Relatives
The study of Y-STRs can be very useful in some particular applications, namely in cases of genealogy or kinship investigation, or in forensic identification, especially when female/male mixtures need to be analyzed [23]. In any of these situations, besides knowing haplotype frequencies, it is also very important to estimate the probability of finding different haplotypes in close relatives due to mutation in one or more loci.
Usually, the first concern when selecting Y-STRs is to find a group of markers that allow obtaining highly informative haplotypes in the population used as reference, which is associated not only with the number of Y-STRs under analysis but also with each locus diversity as well as with linkage disequilibrium between alleles at the different selected loci. Nevertheless, no matter the levels of polymorphism, because Y chromosome markers are transmitted without recombination, males that are paternal relatives will share the same Y-haplotype unless a mutation event takes place. This can be a strong limitation in identification or paternity cases when multiple suspects or presumed fathers belong to the same patrilineage. In these situations, distinction of two paternal relatives will be directly related with the mutation rate of the full haplotype and the number of meioses that separates both individuals.
Therefore, we calculated the expected probability of finding different profiles in paternal relatives due to mutation(s). We applied the same formula as Gusmão et al. [24] but used locus specific estimates to calculate the mutation rate of the complete haplotype. The mutation rate of a full haplotype was calculated as the probability of at least one copy error occurring in the meiotic transmission of its n loci. Therefore, the probability of distinguish two paternal related Y chromosome haplotypes is: where n is the number of loci in the haplotype, k is the number of meioses separating two haplotypes, m i is the mutation rate of the ith locus and m average is the average mutation rate calculated for all loci included in the haplotypes. Table 2 presents the results obtained for haplotypes based on the four previously defined sets of 9, 14, 17 and 23 Y-STRs. If two individuals are separated by a single meiosis (and they are father and son), the probability of having two different haplotypes will vary from 2% to 9%, for the 9 and 23 sets, respectively. The addition of the 6 extra markers to the YFiler set doubles the probability of finding different haplotype profiles between close relatives. Therefore, even for close relatives, a larger set of markers will be definitely useful to increase the chance of Y-haplotype exclusion, which is strongly related to the mutation rate of the included markers. Still, in most cases, even for individuals separated by four meiosis (as is the case for two cousins, for example), no differences will be found in almost 60% of the cases and results based on Y-STRs will present a limitation at the individual identification level when paternal relatives are in question.  This is a limitation that is far from being solved with a restricted set of markers, even with exceptionally high mutation rates, as previously shown by Ballantyne et al. [25]. In this study, based on a set of 14 rapidly mutating Y-STR loci (corresponding to 13 markers, one with two loci) in pairs of relatives, the genotyping results showed that in more than 50% of the studied cases there was no difference between father and son. Indeed, if we consider that there is no association between different mutation events, we will need to type a total number of 34 Y-STRs with an average mutation rate of 2% to obtain a 50% chance of finding a difference between father and son. A 99.99% probability of finding differences between any male relatives will require very large sets of quickly mutating markers. To reach this probability, for STRs with an average mutation rate of 0.02 (like those selected by Ballantyne et al. [25]), it is necessary to type a group of at least 115 markers for haplotypes separated by four meiotic transmission, a number that increases to 458 in father-son pairs ( Table 2).

Conclusions
In all genetic studies that are based on haploid segments of the genome, as is the case for mtDNA and the Y chromosome it is very important to have reliable estimates of haplotype frequencies, for which large databases with good coverage are critical [8]. The high susceptibility of lineage markers to genetic drift is responsible for larger genetic distances between populations and population substructure, especially in admixed populations.
Thus far, there is no agreement on the best approach to calculate accurate frequencies for rare Y chromosome haplotypes [26][27][28]. However, no matter the estimation approach we choose, if no substructure exists, pooling local population samples will produce a larger and therefore better represented Y-haplotype database.
Aiming to produce a large representative database of Brazil for highly discriminating Y chromosome-specific haplotypes, we have sampled more than 2,000 Brazilian chromosomes. Despite the different colonization history of each region in Brazil, it was possible to see that all populations included in the present work have very similar haplotype profiles concerning the 23 studied Y-STRs. This apparent lack of genetic heterogeneity is in accordance with previous reports that show a predominance of European male lineages in all regions of the country, in contrast with the much more admixed pattern of the maternal ancestry. There is an absence of a strong regional substructure in Brazil, which was also recently reported by Pena et al. [29] based on autosomal ancestry informative markers. Therefore, if we do not consider the diverse Native American or Afro-descendent isolates, which are spread throughout the country, a single Y chromosome haplotype frequency database will adequately represent the urban populations in Brazil.
The analyses of STRs in a large number of Y chromosomes belonging to Brazilian populations supports the need of using an extended battery of markers to increase haplotype diversity, allowing for a higher discrimination capacity in criminal investigation [30]. In accordance with the previously sustained by Ballantyne et al. [25], the selection of rapidly mutating STR markers will allow an increased capacity to discriminate paternal relatives. Nevertheless, a very large number of STRs will be necessary to allow the use of Y chromosome information for forensic interpretation at an individual level in cases of haplotype match.

Population Samples
The present work follows the ethical principles stated in the Helsinki Declaration (2000) of the World Medical Association, and approved by the institutional review board of the Laboratório de Genética Humana e Médica, Instituto de Ciências Biológicas, Universidade Federeal do Pará. Samples involved in the study are long-lasting anonymized DNA extracts previously obtained with informed written consent from healthy individuals for research purposes. Blood samples and buccal swabs were collected from 2,024 non-related males from different Brazilian states representing five geopolitical regions (see Fig. 1 for sample locations and sizes). DNA was extracted as described previously [31] and quantified using the NanoDrop 100 (NanoDrop Technologies, USA).

Data Analysis
Haplotype diversity was calculated using the formula: where n is the number of gene copies in the sample, k is the number of haplotypes, and p i is the sample frequency of the i-th haplotype.
The total number of observed and unique haplotypes, single locus diversity, AMOVA and pairwise genetic distances (RST) were calculated with the software Arlequin 3.5.1.2 [34] and visualized as a two-dimensional graphic with the Multi Dimensional Scaling (MDS) method implemented in the software STATISTICA 7.0 (StatSoft Inc. 2004).