Fig 1.
An illustration of division of a chromosome to regions according to the numbers of observed and unobserved variants for imputation.
Fig 2.
The overall model structure of the proposed method.
The line in the bottom of the figure indicates a genome sequence where observed variants are in green square and unobserved variants are in white square. Forward and backward RNNs are built on the observed variants. is the input feature vector of the forward and backward RNNs for observed variant vi. fi is the vector from the concatenation of the output of the forward RNN for observed variant vi and the output of the backward RNN for observed variant vi+1.
is a binary variable indicating the allele for unobserved variant ui.
Fig 3.
The structure of the forward RNN for each observed variant for the case of four stacked RNN cells.
and
are input feature vectors for observed variants vi and vi+1, respectively.
is the state of the RNN cell of the jth layer for observed variant vi and used as the input of the state for the RNN cell of the jth layer for observed variant vi+1. The output of the RNN cell of the top layer, oi,4, is handled as the output of RNN for each observed variant.
Fig 4.
The structure of the forward RNN for each observed variant with residual connections for the case of four stacked RNN cells.
and
are input feature vectors for observed variants vi and vi+1, respectively.
is the state of the RNN cell of the jth layer for observed variant vi and used as the input of the state for the RNN cell of the jth layer for observed variant vi+1. Circles with + represent the addition of tensors for residual connections. The output of the RNN cell of the top layer, oi,4, is handled as the output of RNN for each observed variant.
Fig 5.
An illustration of the proposed data augmentation process, where new haplotypes are generated by applying mutations and recombinations for the haplotypes in the reference panel.
Alleles surrounded by bold lines are those for the observed variants, and the mutations are applied only for the observed variants. Alleles in red are those mutated by the data augmentation process.
Table 1.
Averaged R2 values in the validation data for the input feature vectors with size of 5, 10, and 20.
Table 2.
Averaged R2 values in the validation data for several settings.
Fig 6.
(a) Comparison of R2 values for the proposed method with several settings. (b) Comparison of R2 values for the proposed method with hybrid model, higher MAF model, and lower MAF model with the setting of GRU, 4 layers, and 40 hidden units. (c) Comparison of R2 values for the proposed method with and without residual connection (RC), data augmentation (DA), and self-attention (SA), where “Basic hybrid model” is the hybrid model without RC, DA, and SA.
Fig 7.
(a) Comparison of R2 values for the proposed method, ADDIT-M, Impute2, and Minimac3 for the 1KGP dataset. (b) Comparison of R2 values for the proposed method, ADDIT-M, Impute2, and Minimac3 in linear MAF scale and with zoom into higher R2 value for the 1KGP dataset.
Table 3.
Running time of the proposed method, ADDIT-M, Impute2, and Minimac3 for imputation using the 1KGP and HRC datasets.
Fig 8.
Comparison of R2 values for the proposed method, ADDIT-M, Impute2, and Minimac3 for EAS individuals.
Fig 9.
(a) Comparison of the R2 values of the proposed method, ADDIT-M, Impute2, and Minimac3 for the HRC dataset. (b) Comparison of the R2 values of the proposed method, ADDIT-M, Impute2, and Minimac3 in linear MAF scale and with zoom into higher R2 value for the HRC dataset.