Inbreeding estimates in human populations: Applying new approaches to an admixed Brazilian isolate

The analysis of genomic data (~400,000 autosomal SNPs) enabled the reliable estimation of inbreeding levels in a sample of 541 individuals sampled from a highly admixed Brazilian population isolate (an African-derived quilombo in the State of São Paulo). To achieve this, different methods were applied to the joint information of two sets of markers (one complete and another excluding loci in patent linkage disequilibrium). This strategy allowed the detection and exclusion of markers that biased the estimation of the average population inbreeding coefficient (Wright’s fixation index FIS), which value was eventually estimated as around 1% using any of the methods we applied. Quilombo demographic inferences were made by analyzing the structure of runs of homozygosity (ROH), which were adapted to cope with a highly admixed population with a complex foundation history. Our results suggest that the amount of ROH <2Mb of admixed populations should be somehow proportional to the genetic contribution from each parental population.


Introduction
Measures of population inbreeding levels have been traditionally obtained from the direct genotyping of population samples followed by the estimation of heterozygous frequency deviations from the proportions expected under random-mating assumptions (HW expectations) or from the analysis of sets of individual or grouped genealogies (v.g., Lemes et al. [1]). The inbreeding coefficients F estimated from the two methods are however quantitatively and qualitatively different, since the first one (Wright's fixation index F IS ) estimates specifically systematic inbreeding or consanguinity, while the second (Wright's fixation index F IT ) measures the amount of total inbreeding, including the fraction due to small population effective number Ne. In humans, good examples of the usefulness of deep genealogies to measure inbreeding coefficients are the study of the isolated African-derived community of Valongo in Brazil [1,2] and the research on royal inbreeding [3,4]. Ceballos and Alvarez [4] study showed that it is possible to capture 95% of the actual inbreeding coefficient with a pedigree of at least 10 generations depth. On practical grounds, however, only in rare instances it is possible to include precise relationship information on more than three or four generations. PLOS  The situation has changed dramatically with the recent use of large datasets of genomic autosomal single nucleotide polymorphisms (SNPs), allowing the identification of long tracts of consecutive homozygosity (runs of homozygosity or ROH) in human population samples. Studies using this approach have revealed high levels of autozygosity even in cosmopolitan non-inbred populations, showing that there exists, as expected by the out-of-Africa model of human origins, an increment of inbreeding levels and a significant reduction of genetic diversity which is proportional to the geographic distance from Africa [5][6][7]. An important mechanism contributing to a large portion of genomic homozygosity levels, composed mainly by short and intermediate-sized ROH, is background relatedness, which results from the combined effects of demographic and evolutionary events, such as remote inbreeding, geographic isolation, small population size with bottleneck and founder effects, and long-lasting and stable systems of endogamous marriages due to the persistence of cultural traditions [5,[7][8][9][10].
Population isolates are powerful tools for medical and evolutionary studies, since many of them have well documented pedigrees, high prevalence of individuals affected by rare genetic conditions, high degree of inbreeding due to cultural practices or limited population size, and a demographic history of foundation consisting of bottlenecks followed by founder effects [11]. Even in the case of population isolates without well documented pedigrees and a paucity of historical records, reliable genetic information can be obtained from the analysis of large SNP datasets. Several studies of inbreeding and demographic history have been successfully performed around the world in isolated populations with variable amounts of genealogical documentation and historical records of population-based evolutionary phenomena [8,[12][13][14][15][16].
The admixture of populations with different genetic backgrounds can create high levels of linkage disequilibrium (LD), which, in addition to taking many generations to disappear, will interfere with the distribution of ROH. Studies on LD have shown that haplotype sizes rarely surpass 100kb in humans and that total individual ROH lengths measured, with or without LD pruning, are the same when considering ROH longer than 1.5Mb [8,[17][18].
Inbreeding levels are thus informative about a population's history of admixture events, demography, and can also be related to its genetic load and to the prevalence of genetic diseases. Consequently, the reliable estimation of inbreeding levels is of central importance to human population genetics.
By means of the analysis of a dataset of genomic autosomal single nucleotide polymorphisms (SNP), we make inferences on inbreeding levels and demographic history of a Brazilian isolate with about 40% African, 39% European and 21% Native American contribution [19]. This study presents: (1) an alternative way to estimate the population inbreeding coefficient (Wright's fixation index F IS ), based solely on the analysis of a high-density SNP array, in order to compare its statistical parameters with the individual estimates obtained from software PLINK v1.9 [20]; (2) the application of a sliding window approach to identify ROH in a population that underwent a complex demographic history with tri-hybrid ancestral contribution; (3) a comparison between individual estimates of the inbreeding coefficient obtained from ROH data (equivalent to F IT ) and from software PLINK v1.9 (equivalent to F IS ).

The Brazilian Quilombo (QUI) admixed population
The present study was performed in an admixed Brazilian isolate located in the Ribeira River Valley, in the southern part of the State of São Paulo, Brazil (Fig 1). This isolate, known in Brazil as a quilombo, was founded around 1890 by runaway, abandoned and freed slaves (some of them being the admixed offspring of white farm owners and African female slaves) and a few pure or mixed native Americans, who created small rural settlements in isolated areas inside the Atlantic rainforest for several generations (other details of interest on the quilombo population structure and demography are described elsewhere [1,19,21]). The isolate aggregates twelve communities that were treated as a single one, since the degree of differentiation among its communities is very low, with F ST indices generally smaller than 0.05 [1].
Some fifty years ago a road was built near the communities and a significant degree of migration between neighboring populations began to take place. Because of this recent history of admixture, some people argue that the quilombo reported here does not represent a true isolate anymore. In order to warrant or preserve the isolate condition with which we classify this population aggregate, all individuals selected for this study, aged between 17-65 years, have at least two generations of local quilombo ancestors.
DNA samples were extracted from peripheral blood and genotyped with the SNP array Axiom Genome-Wide Human Origins (~600,000 SNPs) according to the manufacturer's standards (Affymetrix/Thermo-Fisher Scientific). We analyzed DNA samples from 541 individuals (S1 Table) from the Ribeira River Valley, 365 of them having already been genotyped in a previous study [22] and the remaining 176 samples of this study. The research was approved by the Ethics Committee, Instituto de Ciências Biomédicas, Universidade de São Paulo (111/CEP, Feb. 14 th 2001), and an informed consent was obtained from all its participants or their legal guardians.

HGDP samples
Data of 934 humans were selected from dataset 11 of the Human Genome Diversity Project (HGDP), which includes individuals from Africa (105), Europe (151), Middle East (160), Central South Asia (197), East Asia (231), Oceania (28), and America (61), many of them from population isolates. This sample was also genotyped for the same set of~600,000 SNPs described in the section above and is available at ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/.

Data preparation
In order to minimize the effects of genotyping error, we carried out in the QUI dataset a process of data cleaning which excluded: (1) all markers with low quality scores, using the software Genotype Console Software v.4.2 according to the manufacturer's standards parameters (Genotype Console Workflow-Affymetrix/Thermo Fisher Scientific); (2) all markers with significant differences in missing data proportions between groups (defined by sex, experimental batch, and subpopulation status) using the R package GWASTools v.3.5 [23]; (3) all genotyped loci with more than 10% of missing values; (4) all data from loci with extreme deviations from Hardy-Weinberg proportions (P 10 −8 ), using the asymptotic exact test [24] by means of the software PLINK v1.9 [20]. We also excluded (1) all data from autosomal triallelic markers, mitochondria and X and Y chromosomes (X markers were excluded because after data cleaning and merging their number was critically reduced); (2) all markers with minor allele frequency (MAF) of 0, i.e., all alleles that were fixed; and (3) all data corresponding to markers within the 2Mb of the extremities of all chromosome arms, for which genotyping is less reliable. The final QUI set consisted of data from 477,426 autosomal SNPs.
We also excluded loci data of each HGDP population having more than 10% of missing values, extreme deviations from Hardy-Weinberg proportions (P 10 −8 ), or within 2Mb from the extremities of all chromosome arms. The final HGDP set was merged with QUI, consisting of data from 388,702 autosomal SNPs.

Estimation of the inbreeding coefficient (Wright's fixation index F IS )
With the aim of comparing their statistical parameters (mean, median, variance, and 95% approximate and 'exact' confidence intervals), Wright's fixation index F IS was estimated using two different methods (detailed in the paragraphs below). The first method obtains the population inbreeding coefficient averaging the fixation indices estimated from each locus of all sampled individuals; in the second one the population inbreeding coefficient is obtained by averaging the fixation indices indirectly obtained from all sampled loci of each individual. As one can guess, the two methods should be a priori grossly equivalent, but (as stressed before) we are interested only in comparing their corresponding parameters with the obvious aim to verify whether one out of the two might be occasionally more appropriate, adequate or convenient to use. As far as we can tell, this has not been performed in the literature before.
To obtain the average estimates (across all loci of all individuals from QUI sample) of Wright's fixation index F IS we used the information from (1) all 477,426 SNPs (complete dataset) and (2) 11,321 SNPs with no LD (no-LD dataset), obtained using the software PLINK v1.9 [20], considering a threshold of r 2 = 0.0071, which corresponds to a critical 5% chi-square value of χ 2 = 3.841, pairwise estimated in sliding windows of 50 SNPs incremented in steps of 5.
First estimate of Wright's F IS coefficient. Using the first method (described above), the inbreeding coefficient F IS = f k was obtained for each biallelic locus by means of the classical and basic formula where P k (Aa) and 2p k q k are respectively the observed and HW expected frequencies of heterozygous genotypes at the k-th locus. The mean population inbreeding coefficient ( " f ) over all loci was obtained weighing the per locus f k estimates by the reciprocals of their corresponding variances: with where n is the number of loci and var(f k ) is the estimate of the variance of f k , obtained for each biallelic locus by the formula [25][26][27]: where N is the sample size, and p k and q k are the frequencies of the alleles segregating at the kth bialellic SNP locus.
In the long run, one expects that the estimates of f k thus obtained should be normally distributed around the average value of " f , with the limits of the usual 95% confidence interval being given approximately by " f AE 1:96 with x k as defined in formula (3) [1]. We also ranked the values of f k in order to obtain the median and its observed ('exact') 95% confidence interval corresponding to the set of all values between the limits of the 2.5 th and 97.5 th percentiles.
Second estimate of Wright's F IS coefficient. The estimate of the inbreeding coefficient F IS for each individual of the sample, referred here as f' i , was obtained by means of the function -het of the software PLINK v1.9 using the expression: where O i and E i are the observed and expected numbers of homozygous genotypes considering all L i genotyped autosomal SNPs of individual i [20]. The mean value ( " f 0 ) was obtained by averaging all f' i estimates; the corresponding 95% confidence interval of the whole distribution were obtained either using a normal approximation as before, or by ranking all the individual values.

Identification of runs of homozygosity (ROH)
The identification of ROH was performed in the merged data (QUI + HGDP) by means of the software PLINK v1.9 [20], a method that has been successfully applied in many studies, enabling meaningful comparisons between different populations, cohorts, and array chips [17,18]. The algorithm is known to be also able to provide reliable ROH calls even when using data from next generation sequencing [28,29]. We considered here the same criteria described by McQuillan et al. [8] and Kirin et al. [5]: a sliding window with 50 SNPs; a maximum of one heterozygous genotype and five missing calls allowed per window; a proportion of windows that overlap to form an homozygous segment of 5%; a density of at least one SNP per 50kb; and a maximum gap between consecutive SNPs of 100kb. All the analysis was performed considering three different sets of ROH, identified considering minimum lengths of 500kb, 1.5Mb, and 5Mb.

Estimation of inbreeding coefficient from ROH
Individual and population inbreeding coefficients were also estimated using ROH data. The F ROH , defined as the genomic autosomal proportion of ROH of an individual, was estimated by the expression [8]: where SL ROH corresponds to the length of ROH and L auto corresponds to the total genomic region covered by the SNP array. Averaging the values of all individual F ROH s we obtain a parameter that is equivalent to Wright's fixation index F IT .

Estimation of the average inbreeding coefficients f and f'
For the estimation of the first coefficient (f) we performed the analysis of complete and no-LD datasets using two approaches: (1) obtaining the " f estimates for subsets of markers in different MAF (minor allele frequency) bins; and (2) observing the behavior of per locus estimates of f k . Average values were estimated for subsets of markers according to thresholds of MAF ! {0, 0.01, . . ., 0.49}, showing a marked shift to negative values for markers with MAF < 0.1, and tendency to reach a plateau for MAF values close to 0.2 and higher (Fig 2).
Considering now the behavior of individual f k estimates across all loci of both datasets (S1 Fig), we notice that in spite of a huge amount of estimates obtained from markers with MAF < 0.1 holding positive values, almost half of f k estimates have near zero and negative values. While these positive f k values are associated with larger var(f k ), the negative ones are associated to much smaller values of var(f k ), some of them also very near zero (Fig 3). It makes clear the existence of negative values of f k with very small variance values responsible for creating biased average " f À value, since the average value of f k is calculated after " f ¼ X x

(formulae 2 and 3).
We also observed that the smallest values of MAF are associated with highly heterogeneous var(f k ) values (S2 Fig). The f k values associated with lowest var(f k ) estimates are strongly influencing the estimation of " f , probably being responsible for preventing the average " f À value to reach the plateau shown for higher MAF values in Fig 2. Therefore, an increase in the proportion of heterozygotes and in negative f k values is associated with estimates of var(f k ) close to zero, whose reciprocal values are incredibly large, thus creating a significant bias in the estimation of the population average " f À value, even when occurring in relation to a very few number of loci.
Taking into account the facts above and the results shown in Fig 2, in order to avoid the use of markers associated with obvious biases in the estimation of the average inbreeding coefficient " f , we considered in our final analysis, presented in the paragraph below, only loci with MAF ! 0.2.
In spite of having their original datasets dramatically reduced in size (the complete one from 477,426 to 232,240 SNPs and the no-LD one from 11,321 to 9,026 SNPs), the f k -values virtually retained their original properties of being symmetrically and normally distributed around their mean and median estimates. Taking into account that both sets were cleaned from most of their biases and errors, the parameters extracted from them (shown in Table 1 below) surely constitute much more reliable estimates.
The average population " f 0 value (obtained averaging the estimates of f ' i from QUI sample using the no-LD SNP dataset) was 0.0112; the median, obtained from the whole f ' i distribution, was 0.0056, with corresponding 95% confidence interval limits of -0.0370 and 0.0986 ( Table 1). The limits of the 95% c.i. using a normal approximation were respectively -0.0514 and 0.0738. Interestingly, these estimates are not very different from those obtained using the traditional methods mentioned above.
The two methods, as previously guessed, are equivalent, since the estimates of the inbreeding coefficient obtained from them are of the same order of magnitude. The first method (that averages the fixation indices estimated from each locus of all sampled individuals), however, in order that non-biased estimates of f be avoided, should be applied to a dataset excluding all markers with a MAF < 0.2, at least for our population. Also, comparing, the corresponding statistical parameters of both f and f', we notice that the variance of f' is significantly lower than that of f, at least when dealing with sample the sizes of ours. In any case, it seems that the second method (that averages the fixation indices indirectly obtained from all sampled loci of each individual) seems to be, on practical grounds, more convenient to use than the first one.

Inbreeding and demographic inferences from ROH
The inbreeding coefficients F ROH of all individuals of the 52 populations (51 from HGDP + QUI) were assessed separately and grouped in continental regions and considering ROH above three thresholds (0.5, 1.5 and 5Mb). In all cases, as expected by the out-of-Africa model of human origins, African and Native American average F ROH estimates (mean and median) were the lowest and the largest values, respectively (Table 2), given that African are the most diverse human populations, while Native American are the least one. For ROH above 0.5Mb, Quilombo have the second lowest F ROH 0.5 estimate, suggesting a high amount of genomic variability, probably influenced by the process of admixture. Considering the ROH cut-off of 1.5MB, the QUI average F ROH 1.5 estimate is higher than the obtained for Africa and Europe and close to the Asian one, a pattern similar to that observed for ROH above 5Mb (F ROH 5 ). As shown by McQuillan et al. [8], F ROH 1.5 correlates the best with estimates obtained from pedigree analysis. Our results show that quilombo populations have, on average, an estimate of F ROH 1.5 a bit higher than the one corresponding to the progeny of third cousins. The F ROH 1.5 estimates were also obtained considering each population separately (S2 Table). The QUI average F ROH values are much lower than those obtained for other isolates like Native Americans Karitiana (~0.10) and Surui (~0.15), but very similar to the estimates from African isolates like Biaka (~0.016) and Mbuti pygmies (~0.014), and San (~0.018), that showed to be at least twice the value observed for Bantu (~0.007), Mandenka (~0.005), and Yoruba (~0.004).
When the individual patterns of ROH are analyzed (Fig 4), one notices that the average total ROH length from QUI composed by ROH lower than 1Mb is higher than African and smaller than European and American total lengths, which is expected since the isolate was founded by individuals of these three different ancestries and the amount of genomic ROH, especially those lower than 1Mb (which reflect the presence of ancient events that occurred in the parental populations), should be approximately proportional, but lower, to the genomic contribution of each parental population. This result suggests that LD patterns of admixed populations are strongly influenced by the LD patterns of the populations from which founder individuals originated.
Considering now the largest ROH (>6Mb), the QUI sample sums, in average, 50 Mb by individual, which is approximately twice the proportion observed for African and European samples. It highlights the occurrence of close inbreeding for at least part of the population and, less probably, the contribution of Native American ancestry components, that also harbor comparatively large portions of very long ROH. Since the foundation of the quilombo population is extremely recent (~8 generations), an interpretation for these results is that ROH <2Mb is probably still capturing non-recent inbreeding events that occurred in parental populations, and only the very large ROHs (certainly those >6Mb) reflect events occurring after origin of quilombo communities. Single ROH larger than 35Mb were found in seven (out of the total of 541) individuals.
We also plotted the numbers of ROH against their total lengths by continent (Fig 5), in order to obtain some additional information on demographic events occurring in the populations [18]. The QUI sample, as expected, showed to have, on average, a low number and a small total length of ROH, far lower than those observed for Native Americans, due probably to the occurrence of admixture, that inserts variability in the population. The presence of inbred QUI individuals is also suggested, since endogamous individuals have a proportionally high total ROH length, highlighted by a departure in the right direction from the main diagonal of the graph.

Relationship between f' and F ROH estimates
The quilombo values of f ' and F ROH were estimated using two different techniques that should be correlated, since they are associated to the inbreeding levels of the population, corresponding to Wright's F is and F it respectively. The scatterplots of Fig 6 show  The analysis above showed that F ROH 0.5 correlates better with f ' than the other estimates, which makes sense, since smaller ROH are more informative on demographic events occurring before recent inbreeding.

Discussion and conclusions
This study dealt with the issue of estimating parameters related to the system of marriages, inbreeding levels, and population/demographic events of a complex tri-hybrid admixed population.
Using information from both complete and no-LD datasets, we estimated Wright's fixation index F IS using two alternative methods. The first method obtains the population inbreeding coefficient averaging the fixation indices estimated from each locus of all sampled individuals; in the second one the population inbreeding coefficient is obtained by averaging the fixation indices indirectly obtained from all sampled loci of each individual. Our analyses showed that the lowest var(F IS ) values might be pivotal in creating biased estimates of F IS -values even occurring in only a few markers; and that the optimal range of MAF for using in the estimation process in the QUI sample is in the range of 0.2 MAF 0.5. The two methods supplied reliable estimates with equivalent values, but since the second one can be directly applied without any further sample selection it is more convenient to use on practical grounds. Interestingly, the estimates we obtained do not diverge significantly from the ones obtained in a previous study of our group [1] using a far smaller number of markers (14 SNPs and 16 microsatellites) from the same population.
In relation to the ROH analysis, we used a reliable method to identify these regions in 52 populations (QUI plus 51 from HGDP), in order to occasionally obtain information about evolutionary forces acting in multiple time scales [7,30].
Taking into account ROH <2Mb, the quilombo population has an intermediate average total length of ROH when its parental population sources (Africa, Europe, and America) are considered. This suggests that the amount of shorter ROH is somehow proportional to the amount of corresponding ROH inherited from the parental stocks. Due to a complex Inbreeding estimates in human populations admixture of individuals from different genomic sources, a factor that introduces genetic variability into the admixed population, its average fraction of shorter ROH should be lower than (but still proportional to) the real contributions from each parental stock.
For homozygous segments larger than 6Mb, the total average lengths of ROH obtained from QUI showed to be approximately twice the estimates from Africa and Europe, reflecting the presence of a very recent and significant amount inbreeding.
We also detected significant positive correlation coefficients between the individual estimates of F ROH and F IS , especially when the set of all ROH above 0.5Mb was considered.  Table. Estimates of inbreeding coefficient from ROH by population. Mean, median and corresponding observed 95% confidence intervals of individual inbreeding coefficients F ROH per continent, considering ROH above 1.5Mb. The estimates were made considering 52 populations (QUI plus 51 from HGDP). (DOCX)