Archetypal Analysis for population genetics

doi:10.1371/journal.pcbi.1010301

Fig 1.

Archetypal Analysis pipeline.

The allele counts from both haplotypes of each of N individuals are averaged and then dimensionally-reduced from M SNPs to N − 1 element singular vectors via the SVD. Archetypal Analysis then implements an alternating non-negative matrix factorization algorithm that minimizes a constrained sum of squares to find ancestry proportions (α) and cluster centroids (Z′; archetypes, Z′ = ZV^T). Archetypal analysis models the individual genotypes as originating from the admixture of K parental populations, where K is an input parameter. For visualization we create bar plots for proportions of archetype assignments given by the matrix α, and project archetypes Z into a 3D subspace using the first three principal components of the individual genotype sequences.

More »

Expand

Fig 2.

Principal component analysis and Archetypal Analysis compositional plots for human populations (K = 8).

a), 2-dimensional PCA plot of human continental populations, where groups of individuals are colored by the unique regional genetic components they possess (see legend) b), Compositional plot giving proportional archetype assignment for each individual (points). Points are coloured by the presence of regional genetic components (colored text) and a few example sub-populations are labeled in small black text. Clusters of individuals from the same population are observed on the vertices of the polygon while diagonals (and edges) between vertices indicate admixed individuals. For details on how to interpret compositional plots see Fig G in S1 Text. c), Similar compositional plot showing the results for ADMIXTURE. Note that several ADMIXTURE clusters (A4, A5, A7) are never attained by real samples. See Figs A and B in S1 Text for additional examples of Archetypal Analysis compositional plots for human continental populations.

More »

Expand

Fig 3.

Comparison of ancestry estimates for human populations (K = 8).

a), three-dimensional PCA plot of individuals (small points) with projected archetypes (circles) and ADMIXTURE cluster centers (triangles). b), bar plot where individuals are represented along the horizontal axis as narrow bars ordered by population group. The height of the color for each bar shows the proportional colored cluster assignment for that individual sample. We compare the cluster assignments of ADMIXTURE (top) and Archetypal Analysis (bottom). Correspondence of numbers to labels can be found in Tables A and B in S1 Text.

More »

Expand

Fig 4.

Principal component analysis and Archetypal Analysis compositional plots for domestic dog breeds.

a), two-dimensional PCA plot of domestic dog breeds where groups of dogs are colored by clade. b) and c), proportional composition of each cluster for each individual in coordinate space for K = 5 and K = 15 archetypes respectively. Data points are coloured by clade and archetype representatives are shown as drawings. Gradients between vertices indicate combinations between breeds. (We thank Ines de Vilallonga for her dog breed illustrations).

More »

Expand

Table 1.

Runtime (in minutes) for ADMIXTURE-AA comparison.

More »

Expand

Fig 5.

Performance metrics analysis.

a), runtime analysis for FRAPPE, ADMIXTURE and Archetypal Analysis for K = 2 to K = 30. Time is expressed in units of accumulated hours. Note that for FRAPPE we only include up to K = 5 due to computational limitations. b), explained variance analysis comparison for ADMIXTURE and Archetypal analysis for K = 2 to K = 22. Results are averaged over five distinct random seed values for each value of K and the ranges observed are shown as vertical bars.

More »

Expand

Table 2.

ADMIXTURE and Archetypal Analysis comparison.

More »

Expand

Fig 6.

Comparison of cluster centroids from different methods.

Cluster centers learned by ADMIXTURE, ADMIXTURE with sparsity regularization, Archetypal Analysis, K-Means, and K-Medoids for K = 4 are plotted as solid circles while the underlying samples are plotted as small blue points. Regularization in ADMIXTURE is introduced with lambda = 500 and epsilon = 0.1.

More »

Expand