Fig 1.
Overall study design and workflow.
A) The reference data sets and genes sequenced and typed from each set; B) The reference data sets and their combinations for each imputation model (I–VII). The Venn diagram indicates the SNP contents used by the models: model I was based only on the SNPs in the Finnish reference set while model VII was based only on the 1000 Genomes SNPs. Models II—VI were based on the intersection of SNPs present on both reference sets. C) Evaluation of the model performance in an independent test set and out-of-bag (OOB) sets within the training and full data sets. D) Cross-validation of short-read whole exome sequencing (WES)-based allele calling in the 1000 Genomes reference against the clinical-grade typed Finnish reference. Created with BioRender.com.
Table 1.
Imputation models evaluated in the present study.
Table 2.
Numbers of individual samples with allele typing results per gene and reference data set.
Fig 2.
Overall imputation accuracies of different models in each population and gene.
The horizontal axis shows the Finnish and the 1000 Genomes superpopulations (FIN, Finnish; EUR, European; AFR, African; EAS, East Asian; SAS, South Asian; AMR, Mixed American). The models trained on different reference compositions (I-VII) are shown on the vertical axis. Note that HLA-G is excluded because it is limited to the Finnish population only.
Fig 3.
Confusion matrices summarizing the allelic accuracies of the best gene-specific models.
A) The combined 1000G and Finnish reference (model VI) for MICA, MICB, HLA-E and HLA-F. B) The Finnish reference (model I) for HLA-G, HLA-G 3’UTR and HLA-G 5’UTR.
Fig 4.
The relationship between allele frequency and accuracy, sensitivity, and specificity.
The relationship is shown for the Finnish (FIN) and 1000 Genomes superpopulations (EUR, European; AFR, African; EAS, East Asian; SAS, South Asian; AMR, Mixed American) when using the Finnish (model II), 1000 Genomes (model V) and combined 1000G and Finnish (model VI) references in the training of the models. Notice the different y-axis scales in the panels.
Table 3.
Overall imputation accuracy and model parameters of HLA-G gene, HLA-G 3’UTR and HLA-G 5’UTR models.
Fig 5.
Evaluation of the quality of the 1000 Genomes whole exome short-read sequencing-based allele calling.
Model VII was trained using the 1000 Genomes reference and applied to the Finnish reference with clinical-grade typing quality. Confusion matrices for MICA, MICB, HLA-E and HLA-F show the alleles that are common to both references and the amount of correctly and wrongly predicted alleles. Empty lines represent the alleles present in the model but absent from the Finnish reference that could not be validated. Overall imputation accuracies were 99.6%, 99.8%, 100.0% and 99.8% for MICA, MICB, HLA-E and HLA-F, respectively.