A genotype imputation method for de-identified haplotype reference information by using recurrent neural network

doi:10.1371/journal.pcbi.1008207

Fig 1.

An illustration of division of a chromosome to regions according to the numbers of observed and unobserved variants for imputation.

More »

Expand

Fig 2.

The overall model structure of the proposed method.

The line in the bottom of the figure indicates a genome sequence where observed variants are in green square and unobserved variants are in white square. Forward and backward RNNs are built on the observed variants. is the input feature vector of the forward and backward RNNs for observed variant v_i. f_i is the vector from the concatenation of the output of the forward RNN for observed variant v_i and the output of the backward RNN for observed variant v_i+1. is a binary variable indicating the allele for unobserved variant u_i.

More »

Expand

Fig 3.

The structure of the forward RNN for each observed variant for the case of four stacked RNN cells.

and are input feature vectors for observed variants v_i and v_i+1, respectively. is the state of the RNN cell of the jth layer for observed variant v_i and used as the input of the state for the RNN cell of the jth layer for observed variant v_i+1. The output of the RNN cell of the top layer, o_i,4, is handled as the output of RNN for each observed variant.

More »

Expand

Fig 4.

The structure of the forward RNN for each observed variant with residual connections for the case of four stacked RNN cells.

and are input feature vectors for observed variants v_i and v_i+1, respectively. is the state of the RNN cell of the jth layer for observed variant v_i and used as the input of the state for the RNN cell of the jth layer for observed variant v_i+1. Circles with + represent the addition of tensors for residual connections. The output of the RNN cell of the top layer, o_i,4, is handled as the output of RNN for each observed variant.

More »

Expand

Fig 5.

An illustration of the proposed data augmentation process, where new haplotypes are generated by applying mutations and recombinations for the haplotypes in the reference panel.

Alleles surrounded by bold lines are those for the observed variants, and the mutations are applied only for the observed variants. Alleles in red are those mutated by the data augmentation process.

More »

Expand

Table 1.

Averaged R² values in the validation data for the input feature vectors with size of 5, 10, and 20.

More »

Expand

Table 2.

Averaged R² values in the validation data for several settings.

More »

Expand

Fig 6.

(a) Comparison of R² values for the proposed method with several settings. (b) Comparison of R² values for the proposed method with hybrid model, higher MAF model, and lower MAF model with the setting of GRU, 4 layers, and 40 hidden units. (c) Comparison of R² values for the proposed method with and without residual connection (RC), data augmentation (DA), and self-attention (SA), where “Basic hybrid model” is the hybrid model without RC, DA, and SA.

More »

Expand

Fig 7.

(a) Comparison of R² values for the proposed method, ADDIT-M, Impute2, and Minimac3 for the 1KGP dataset. (b) Comparison of R² values for the proposed method, ADDIT-M, Impute2, and Minimac3 in linear MAF scale and with zoom into higher R² value for the 1KGP dataset.

More »

Expand

Table 3.

Running time of the proposed method, ADDIT-M, Impute2, and Minimac3 for imputation using the 1KGP and HRC datasets.

More »

Expand

Fig 8.

Comparison of R² values for the proposed method, ADDIT-M, Impute2, and Minimac3 for EAS individuals.

More »

Expand

Fig 9.

(a) Comparison of the R² values of the proposed method, ADDIT-M, Impute2, and Minimac3 for the HRC dataset. (b) Comparison of the R² values of the proposed method, ADDIT-M, Impute2, and Minimac3 in linear MAF scale and with zoom into higher R² value for the HRC dataset.

More »

Expand