Table 1.
The capabilities of three PCA-based methods (PCA scatter plots with optional clustering and association testing SNPs against either cluster labels or PC coordinates) are summarized. We compare the methods on detecting, genotyping, and localizing inversions in terms of capability, easy of use, and potential for ambiguous results.
Fig 1.
Workflows for detecting, localizing, and genotyping inversions.
The three approaches (PCA with clustering, PC-SNP association testing, and Cluster-SNP association testing) all begin with performing PCA on a feature matrix generated from SNP data. K-Means clustering is performed using the PC coordinates to infer genotypes. The inferred genotypes and PC coordinates of the samples are represented using scatter plots. Association testing can be performed between the samples’ SNP genotypes and either the PC coordinates or cluster labels. The p-values from the association tests are plotted along the chromosome in a Manhattan plot to visualize the spatial distribution of the associations and detect and localize inversions.
Table 2.
Characterization of SNP data sets.
A benchmark data set for evaluating methods for inversion detection using using SNP data was formed from data for three insect species (D. melanogaster [37, 38], An. gambiae and coluzzii [17, 39]). The chromosome arms were organized into three test cases (negatives, positive drawn from a single population, and positive drawn from multiple populations) based on known inversion genotypes from previous papers. We analyzed SNPs from the 2R chromosome arms of An. gambiae and coluzzii but do not include these data in our benchmark data set since not all inversions were fully characterized. For each chromosome arm, the geographic locations in which the samples were collected, species of the samples, number of samples, inversions identified in these data by the original authors and their frequencies, and the number of SNPs are provideded.
Fig 2.
Analysis of chromosome arms without known major inversions (Drosophila 3L—6 samples with inversion excluded (see Methods), 150 Anopheles 3L, and 34 Anopheles 3L). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—l, one Manhattan plot per PC).
Table 3.
Occurrences of 2La inversion genotypes by location for 34 Anopheles samples.
The 2La inversion genotypes for the 34 An. gambiae and coluzzii samples from [39] by were analyzed for association with geographic location. The homozygous inverted genotype was observed primarily in samples from Cameroon, while the homozygous standard genotype was observed in samples only from Burkina Faso and Mali. Association of the inversion genotypes with geographic location prevents correction for potential confounding effects for this data set.
Table 4.
Occurrences of 2La inversion genotypes by Anopheles species and data set.
The 2La inversion genotypes for the 34 An. gambiae and coluzzii samples from [39] and 150 An. gambiae and coluzzii samples from [17] were analyzed for association with species. The two papers do not agree on the definitions of the standard and inverted orientations. The homozygous standard inversion genotype was not observed in the 150 Burkina Faso samples but was dominant in the Burkina Faso samples from [39] (see Table 3). Likewise, the homozygous inverted genotype was not observed in the Burkina Faso samples from [39] but was dominant among the 150 Burkina Faso samples.
Fig 3.
Positive cases with a single species.
Analysis of chromosome arms with known major inversions in samples drawn from a single species (Drosophila 2L, 2R, and 3R). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—l, one Manhattan plot per PC).
Fig 4.
Positive cases with a multiple species and/or populations.
Analysis of the 2L Anopheles chromosome arm with known major inversions in samples drawn from multiple species and/or locations (150 Anopheles from Burkina Faso, 81 Anopheles gambiae samples of the 150 Anopheles samples, and 34 Anopheles gambiae and coluzzii samples from four geographic locations). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—k, one Manhattan plot per PC).
Table 5.
We evaluated a single methods (PCA with clustering) on the genotype inference task (which inversion genotype does a sample have?) using two benchmark test cases (positive from a single population and positive from multiple populations). Note that the two association-testing methods are not able to infer genotypes. For each chromosome arm used, we indicated known inversions, how many genotypes are present in the data set, and a measure of balanced accuracy calculated from the cluster predictions. The D. melanogaster 3R chromosome arm has three mutually-exclusive inversions, which we list separately.
Table 6.
We evaluated the two association-testing methods (PC-SNP and Cluster-SNP association tests) on the inversion localization task (what region is spanned by an inversion?) using two benchmark test cases (positive from a single population and positive from multiple populations). Note that the two PCA scatter plot method is not able to localize inversions. For each chromosome arm used, we indicated known inversions, the expected ranges, and the ranged identified be each method. The D. melanogaster 3R chromosome arm has three mutually-exclusive inversions, which we list separately.
Fig 5.
Analysis of the 2R chromosome arm of the 150 Anopheles samples from Burkina Faso (all samples, 81 Anopheles gambiae samples, and 69 Anopheles coluzzii samples). (a—c) PCA of samples, clustered with k-means, and colored by cluster. Manhattan plots visualizing p-values from association tests against sample cluster IDs (d—f) and PC coordinates (g—k, one Manhattan plot per PC).
Table 7.
Summaries of inversion analysis tools.
Details of existing software tools that were either designed or can be applied to inversion analysis using SNP data are summarized.
Table 8.
We evaluated three methods (PCA with clustering, PC-SNP association testing, and Cluster-SNP association testing) on the inversion detection task (is an inversion present?) using our three benchmark test cases (negative, positive from a single population, and positive from multiple populations). For each chromosome arm used, we indicated known inversions and whether the inversion was detected by a given method. The D. melanogaster 3R chromosome arm has three mutually-exclusive inversions, which we list separately.