MixFit: Methodology for Computing Ancestry-Related Genetic Scores at the Individual Level and Its Application to the Estonian and Finnish Population Studies

Ancestry information at the individual level can be a valuable resource for personalized medicine, medical, demographical and history research, as well as for tracing back personal history. We report a new method for quantitatively determining personal genetic ancestry based on genome-wide data. Numerical ancestry component scores are assigned to individuals based on comparisons with reference populations. These comparisons are conducted with an existing analytical pipeline making use of genotype phasing, similarity matrix computation and our addition—multidimensional best fitting by MixFit. The method is demonstrated by studying Estonian and Finnish populations in geographical context. We show the main differences in the genetic composition of these otherwise close European populations and how they have influenced each other. The components of our analytical pipeline are freely available computer programs and scripts one of which was developed in house (available at: www.geenivaramu.ee/en/tools/mixfit).

1. Reference individuals' genome wide data were compiled in ped/map files (Plink format) so that each ancestry reference group (22 total) was represented by the same number of individuals (45, limit determined by the smallest size reference group) based on self-reported ancestry. The unknown individual's data were appended to the end of the reference file.
2. Compiled genotype data were phased with SHAPEIT. The results were converted into the IMPUTE2 format for subjecting to the following ChromoPainter step (script from the ChromoPainter website).
3. ChromoPainter was used to divide the phased genome data into chunks based on genetic identity. The resulting chunkcounts file is a matrix which lists pair-wise similarity between the individuals in terms of the number of common genome chunks. Each genome chunk is always assigned to the best fitting individual pair. This means that all individual pairs "compete" for the chunk assignments and as a result it is important that each unknown data set experiences the same "neighbors" (is combined with the same reference data sets) during the chunk assignment process. Each ChromoPainter run produced an array of 991 numbers (ARRAY) indicating that particular individual's similarity with all 990 reference individuals and itself. The same ChromoPainter analysis was also repeated for all references in the absence of any unknowns so that a matrix of 990 x 990 numbers (MATRIX) is produced showing every reference individual's similarity with all other reference individuals. 4. Chunkcount matrix manipulations. The ARRAY contains counts of common chunks between the unknown individual and the reference individuals. Each reference belongs to one of the reference group (nationality group). The common chunk counts of all references are averaged within each reference group for the unknown. As a result, the individual is characterized by their similarity with each reference group as a whole (via a "hypothetical average person") and no longer with each reference individual separately. This horizontal compression reduces the number of columns in the matrix to that of the reference groups (22). The same horizontal compression is also carried out for the MATRIX. Since the MATRIX contains the same individuals both horizontally and vertically, it is additionally compressed vertically following the same logic. The resulting MATRIX has both dimensions equal to the number of the reference groups and each value represents the average number of common chunks between the two reference groups. The reference groups in the MATRIX are now expressed in the same way as the individuals in the ARRAY. The ARRAY and the MATRIX are additionally scaled across the columns so that the mean value in each row becomes equal to one. These steps create genetic similarity matrices a) between the unknown and each reference population, b) between each reference population and all other reference populations. Note that the described matrix manipulations can be performed with MS Excel or similar software.
5. MixFit analysis. MixFit finds the mathematical best fit between the ARRAY and the rows of the MATRIX to determine the mix (amalgamate) of references that best describes the unknown in terms of the normalized average common chunk distribution. Three best-fitting references (rows of the MATRIX) are identified and quantified for each unknown. The fractional values of the references that best explain the ancestry of the unknown are called the ancestry components. The best three ancestry components are determined by considering all combinations of all references. This "amalgamate" deconvolution is similar to deconvolution of composite color into individual RGB components (example: green = 50% blue, 50% yellow, 0% red). The MixFit fitting process is a multi-dimensional fitting where similarity between an individual and a reference group is considered maximal when the sum of all sub-distances linking the individual and the reference is minimal. The sub-distances are those between the ancestry components of the individual and a reference and are expressed as groupaveraged and scaled common genomic chunk counts. With 22 reference groups, the sum of all 22 distances between the individual and all references would have to be minimal. As an example, a person is more likely to belong to a hypothetical Group A not if the genetic identity (chunk count) with the average Group A individual is the highest but when all of its distances from the other groups are as similar as possible to those between Group A and the other groups (this is further explained below). Therefore the distance between two groups is not defined simply as the distance between certain genetic ancestry components but a global best fit of all ancestry components considered. This approach enables to better deconvolute the ancestry components because the distances are not simply linear measures but rather locations on a multi-dimensional landscape.

Section B. MixFit algorithm
The MixFit application allows to change various assignment settings and therefore modify the assignment algorithm (usually to respond to the training sets most adequately). The algorithm used in this work is as follows: MixFit isolates up to 3 reference populations that collectively most closely resemble the composition of the unknown. Initially there are total of n reference populations. All references are tested (3 at a time) against the others by gradually changing their relative ratios in the mixture of 3 references and comparing the result with the unknown.
As the reference fractions are systematically varied relatively one another three at a time (ONE is varied from 0 to 1, TWO is varied correspondingly from 1 to 0 and THREE is held constant; then the same logic is repeated for a new value of THREE) the fit between the mixture and the unknown fluctuates smoothly between better and worse. The local best fit minima are detected and their corresponding reference ratio values are saved. The values that were among the best 30% (this can be changed with the "-a1" flag followed by a number between 0 and 1 in the command line) of minima values are kept for the following steps.
When all reference combinations are tested (in increments of 0.01, this can be changed) all reference ratio values from all runs that ranked among the 20% best ones (in terms of fit with the unknown; from among the ones that passed the "-a1" flag filter) are added together by the reference. (This can be changed with the "-a2" flag followed by a number between 0 and 1 in the command line). Now each reference has a value that indicates how much it was "needed" in all simulations to achieve the best fit. The references are ranked according to these scores and the three highest ranking references are the ancestry components for the unknown. Because the three components may have all come from unrelated simulation runs, one more simulation is performed to find the best ratios among the three chosen references. For this a combinatorial simulation is carried out such that all ratios of all three references are tested against the unknown. The best 10% of the values (this can be changed with the "-a3" flag followed by a number between 0 and 1 in the command line) are averaged for the final answer of what the best ratio between the three references is expected to be.

Section C. Explanation how MixFit algorithm works
Below is a simplified example of how MixFit algorithm computes distances and assigns components. Genetic similarity values of the references and the Unknown are represented as values with the mean scaled to one. GOF (goodness of fit) scores are determined for the Unknown relative to each reference as shown: Section D. Stability of the method We determined the stability of our pipeline by analyzing a subsection of individuals multiple times. Twenty randomly selected individuals (self-reported Estonians) were selected and analyzed with the SHAPEIT-Chromopainter-MixFit pipeline 5 times. SHAPEIT is a stochastic process and introduces variation. The amount of variation was quantified. The main ancestry component assignment (considering 5 assignments for each individual) was used as the reference and the discrepancies from it were found (among the 5 assignments).
Of the 20 individuals 15 were assigned identically all 5 times. Three individuals had one discrepancy (one ancestry component was assigned differently) in one of the five assignments (misses=3*1), 1 individual had 1 discrepancy in 2 of the 5 assignments (both were the same; misses=1*2), 1 individual had one discrepancy in 3 of the 5 assignments (two different types; misses=3*1). The overall constancy therefore is estimated as the number of misses relative to all assignments: (20*5) -8 = 92%. In all cases (within each set of 5 assignments) two of the three ancestry components were always the same. Average variance of assignment of the major ancestry component across all 20 individuals was 0.0072.

Section E. Sensitivity to replication
Some individuals (92)  The data from the Health 2000 (Finland) used in this study contains self-reported nationality information for 2000 individuals. Among them 10 individuals have two (self-defined) foreign parents. We analyzed these individuals to determine the fit with the computed ancestry (

Section H. Estonian cohort description
The Estonian Biobank is the population-based biobank of the Estonian Genome Center of the University of Tartu (EGCUT

Section J. Chromosome selection
An ancestry assignment experiment was carried out with data from 22 autosomal chromosomes. We compared the results from individual chromosome experiments with those of the whole-genome analysis to show that chromosome 1 alone served as the best chromosome to represent the whole genome. We concluded that in the interest of computational feasibility ancestry assignments can be based on chromosome 1 alone.