Table 1.
A brief history of phasing and imputation tools.
Fig 1.
Pre-processing of the HD genotype chips, reference panels and WGS EBB data.
Pre-processing of the HD genotype chips, reference panels and WGS EBB data downloaded from the International Genome Sample Resource (IGSR) and Estonian Biobank Estonian Genome Center respectively. Steps highlighted in orange are specific to the 1000GPphase3 reference panel only; steps highlighted in red are specific to EBB data only and steps highlighted in cyan (light blue) are specific for chip Affymetrix and Omni to isolate only a portion of the dataset to perform analysis on it. All other steps were performed for both reference panels and datasets.
Fig 2.
Workflow of the analysis, combinations tested.
Affymetrix, Omni, Customized, and EBB input chip datasets were analyzed using the 36 combinations of 3 different phasing software, 2 phasing approaches, 3 imputation software, and 2 imputation reference panels. EBB input chip dataset was analyzed using the 36 combinations of 3 different phasing software, 2 Reference panels, 2 phasing approaches and 3 imputation software.
Fig 3.
Number of shared variants between datasets.
Variants in common between the different chips on chromosome 20.
Fig 4.
Precision and recall evaluation of phasing softwares Beagle5.4, ShapeIT4, and Eagle2.4.1.
Precision and recall were evaluated using 540 trio children in the 1000GP-30x reference panel. Trios were selected and phased using trioPhaser software to ensure the highest accuracy and then the children were used as ground truth for the comparison. ShapeIT4 (pink dot) got the highest scores over Eagle2.4.1 and Beagle5.4 respectively.
Table 2.
Reference-free and reference-based phasing accuracy based on 502,377 variants.
Fig 5.
CPU run time and memory usage of phasing software in trios dataset.
Average run time for phasing (5A). Average memory usage for phasing (5B) in trios data.
Fig 6.
CPU run time and memory usage of phasing software using chip Omni, Affymetrix and Customized.
Average run time for phasing (6A). Average memory usage for phasing (6B) in chips data.
Fig 7.
CPU run time and memory usage for phasing softwares in EBB chip dataset.
Average run time for phasing (6A). Average memory usage for phasing (6B) in chips data.
Fig 8.
CPU run time and memory usage of phasing software in EBB WGS dataset.
ShapeIT CPU time and memory usage are higher with a bigger input data of variants and individuals. (8A-8B) highlights a reference-free approach while (8C-8D) a reference-based approach.
Table 3.
MAF-stratified comparison of imputation software for EBB data.
Fig 9.
Imputation performance for chromosome 20 using EBB data with 2280 individuals with 2 reference panels and 2 phasing approaches.
Blue colors indicate Beagle5.4, violets indicate Impute5 and oranges indicate Minimac4.
Fig 10.
Evaluation of rare variants imputation.
Violin plot of IQS against minor allele frequency (MAF) in the EBB dataset.
Fig 11.
Minor allele frequency (MAF) stratification of imputed variants.
Dots are clustered following minor allele frequency stratification. The dots clustered in the right-down corner of the figure have low IQS and high error rate, while dots in the left-high corner have high IQS and low error rate. Each dot represents the average IQS and error rate for a specific marker imputed with one phasing tool-imputation tool combination.
Fig 12.
Imputation concordance rate over four different features.
Stacked density plot of accuracy stratified by (A) sex; (B) superpopulation; (C) chip data; (D) phasing type (reference-free and reference-based).
Table 4.
Accuracy for different superpopulations in chips Affymetrix, Omni, Customized.
Accuracy as measured by concordance (Po) of the imputation results for each of the five main super populations.
Fig 13.
Cluster map of target population against 54 software-reference panel-dataset combinations.
This figure depicts the concordance results for the reference-free and reference-based phasing approaches for each of these combinations. Higher density chips with a reference-based phasing approach and with populations without African ancestry obtained better results in terms of imputation accuracy measured by Concordance.
Fig 14.
CPU run time and memory usage of imputation software for Affymetrix, Omni, Customized datasets.
Average run time for imputation (A) tools. Average memory usage for imputation (B) tools in chips dataset.
Fig 15.
CPU run time and memory usage of imputation software for EBB chip data.
Average run time for imputation (A) tools. Average memory usage for imputation (B) tools in EBB chip data.
Fig 16.
CPU run time of imputation and phasing combinations tested.
Average run time for each of the 9 phasing and imputation software combinations. (A) Run time comparison of each combination in Affymetrix, Omni, Customized datasets. (B) Run time comparison of each combination in the EBB dataset.