Fig 1.
Overview of genotype vs. summary statistics imputation.
From genotype data (top-left, G) we can calculate summary statistics (top-right, SS). Summary statistics for an unmeasured/masked SNV can be obtained via two ways: we can impute genotype data (bottom-left, G-GTimp) using genotype imputation and then calculate summary statistics via linear regression (bottom-middle, SS-GTimp), or by applying summary statistics imputation on the summary statistics calculated from genotype data (bottom-right, SS-SSimp). For the purpose of our analysis, we are only looking at genotyped (and genotype imputed) SNVs, thus masking one focal SNV at the time and imputing it using summary statistics from neighbouring SNVs. We can then compare the three summary statistics calculated for a particular focal SNV in Figs 4, 5 and S11–S14.
Fig 2.
Overview of imputation and replication scheme.
This illustration gives an overview how we used > 2M GIANT HapMap summary statistics (black rectangle) as tag SNVs to impute > 10M variants with MAF≥ 0.1% in UK10K. After adjusting the summary statistics for conditional analysis we applied a selection process that resulted in 35 candidate loci. To confirm these 35 loci we used summary statistics from UK Biobank (blue) as replication as well as summary statistics from the exome chip study, if available [13] (red). Loci that had not been discovered by the exome chip study, were termed novel.
Fig 3.
Accounting for variable sample size.
Effect of missingness on accuracy of imputation of standardised effects, evaluated via simulations where true effect is known. The y-axis is the MSE (on log-scale) between the true standardised effect and the conventional estimate which ignores missingness (Eq (1), grey), our estimate D(dep) (Eq (10), green), and our estimate D(ind) (Eq (11), blue). The x-axis is the ‘missingness-correlation’ (θmiss), where a value of 1 means the number of individuals in the samples had maximum overlap with each other, and 0 means they were simulated independently leading to smaller overlap. Each boxplot shows the MSEs across the 40 regions simulated. Top row is where the N’s (simulated sample sizes) are selected randomly from a study of T2D [31], with sample sizes varying between 13 and 110′219 individuals. Bottom row is based on HDL [30], with sample sizes ranging between 50′000 and 187′167 individuals. All sample sizes are scaled to 0-to-12500 as this is the size of the simulated GWAS.
Fig 4.
Summary statistics imputation versus genotype imputation for associated variants.
The x-axis shows the Z-statistics of the genotype data (ground truth), while the y-axis shows the Z-statistics from summary statistics imputation (green) or genotype imputation (blue). Results are grouped according to MAF (columns) and imputation quality (rows) categories and the numbers top-right in each window refers to the number of SNVs represented. The identity line is indicated with a dotted line. The estimation for correlation and slope are noted in the bottom-right corner for summary statistics imputation and in the top-left corner for genotype imputation. Blue dots are plotted over the green ones. S11 and S13 Figs provide scatterplots with the imputation quality of summary statistics imputation and genotype imputation as colors.
Fig 5.
Summary statistics imputation versus genotype imputation for null variants.
The x-axis shows the Z-statistics of the genotype data (ground truth), while the y-axis shows the Z-statistics from summary statistics imputation (green) or genotype imputation (blue). Results are grouped according to MAF (columns) and imputation quality (rows) categories and the numbers top-right in each window refers to the number of SNVs represented. The identity line is indicated with a dotted line. The estimation for correlation and slope are noted in the bottom-right corner for summary statistics imputation and in the top-left corner for genotype imputation. Blue dots are plotted over the green ones. S12 and S14 Figs provide scatterplots with the imputation quality of summary statistics imputation and genotype imputation as colors.
Fig 6.
Visualising RMSE of summary statistics imputation and genotype imputation.
This figure uses boxplots to compare the absolute difference |d| (used for calculation of RMSE) for each variant between Z-statistics of summary statistics imputation (SSimp, green) and genotype imputation (GTimp, blue) of associated SNVs (left column) and null SNVs (right column). Results are grouped according to MAF (x-axis) and imputation quality (rows) categories. The numbers printed above the boxplot represents the number of SNVs used for the |d| calculation in that MAF and imputation quality subgroup. The corresponding is shown in Table 1.
Table 1.
RMSE for summary statistics imputation and genotype imputation.
Fig 7.
This figure compares the false positive rate (FPR) (x-axis) versus the power (y-axis) for genotype imputation (blue) and summary statistics imputation (green) for different significance thresholds (α), including a 95%-confidence interval in both directions (vertically as a ribbon and horizontally as lines). The vertical, dashed line represents FPR = 0.05. Results are grouped according to MAF (columns) and imputation quality (rows) categories. A zoom into the area of FPR between 0 and 0.1 can be found in S5 Fig.
Table 2.
Twenty replicating candidate loci for height.
Table 3.
GTEx annotation results for variants in eQTLs.
Table 4.
Known trait association results for variants in Table 2.
Fig 8.
rs28929474 is a missense variant on chromosome 14 in gene SERPINA1, low-frequency (MAF = 2.3%), imputed summary statistics (PSSimp = 1.06×−13), replication in the UK Biobank (PUKBB = 6.49×−78). rs112635299 has the strongest signal in this region (P = 4.21 × 10−14), but is highly correlated to rs28929474 (LD = 0.95). This figure shows three datasets: Results from the HapMap and the exome chip study, and imputed summary statistics. The top window shows HapMap P-values as orange circles and the imputed P-values (using summary statistics imputation) as solid circles, with the colour representing the imputation quality (only shown). The bottom window shows exome chip study results as solid, grey dots. Each dot represents the summary statistics of one variant. The x-axis shows the position (in Mb) on a ≥ 2 Mb range and the y-axis the −log10(P)-value. The horizontal line shows the P-value threshold of 10−6 (dotted) and 10−8 (dashed). Top and bottom window have annotated summary statistics: In the bottom window we mark dots as black if it is are part of the 122 reported hits of [13]. In the top window we mark the rs-id of variants that are part of the 122 reported variants of [13] in bold black, and if they are part of the 697 variants of [12] in bold orange font. Variants that are black (plain) are imputed variants (that had the lowest conditional P-value). Variants in orange (plain) are HapMap variants, but were not among the 697 reported hits. Each of the annotated variants is marked for clarity with a bold circle in the respective colour. The genes annotated in the middle window are printed in grey if the gene has a length < 5′000 bp or is an unrecognised gene (RP-).
Table 5.
111 variants: Fraction of top variants in exome chip study retrieved with imputation of HapMap study.