Deep convolutional and conditional neural networks for large-scale genomic data generation

doi:10.1371/journal.pcbi.1011584

Fig 1.

Wasserstein GAN (WGAN) model for the 65,535-SNP dataset.

a) Representation of the generator, critic and residual blocks. Channel dimensions are not proportional and do not reflect the real implementation. The generator block has one trainable location-specific variable (blue) and two latent space vectors (red) as additional channels concatenated to the input. The critic block has one trainable location-specific variable (blue) as an additional channel concatenated to the input. b) Architecture of the WGAN model. White rectangles correspond to generic generator and critic blocks whereas grey rectangles correspond to generic residual blocks. Numbers in parentheses above blocks show channels and length, respectively (C, L). Dotted connections are residual connections where the input value is added to the output value of the block before passing to the next block. Yellow input and output blocks differ from the generic ones for proper dimension adjustments.

More »

Expand

Fig 2.

Illustration of the learning and sampling of a large sequence using a “classical” and a conditional RBM (CRBM).

Initially, we train RBM1 (left) and RBM2 (right) in parallel. Both RBMs are essentially trained in a similar manner: random inputs are drawn and k MC steps are performed before computing the gradient and updating the weights using gradient descent. The difference for the CRBM (RBM2) is that half of the variables in the visible layer are pinned (crossed squares) to the real data during training while the rest is generated conditionally on these pinned variables. After training both machines, we can sample a complete new sequence. To do so, we start from random input and perform k MC steps to generate the first part of the sequence (light yellow-red) using RBM1. Then, we use half of this generated sequence (light red) as the pinned visible variables of the RBM2 (crossed squares) and initialise the rest as random input. We perform k MC steps on RBM2 while keeping the pinned variables fixed to generate the rest of the sequence (light blue). The letters next to arrows show the order of this sampling procedure.

More »

Expand

Fig 3.

Principal component and allele frequency analyses of artificial genomes with 10,000-SNP size.

a) Density plot of the PCA of combined real and artificial genome datasets. Density increases from red to blue. (b) Allele frequency correlation between real (x-axis) and artificial (y-axis) genome datasets. Bottom figures are zoomed at low frequency alleles (from 0 to 0.2 overall frequency in the real dataset). Values presented inside the figures are Pearson’s r, ordinary least squares regression slope and intercept.

More »

Expand

Fig 4.

Schematic representation of different problematic training outcomes for generative models.

Distances to the nearest neighbours are denoted by d_xx where x can be T (truth/real) and S (synthetic/generated). a) An extreme case of underfitting (or optimization issue) in which the nearest neighbours of real data points are real and the nearest neighbours of synthetic data points are synthetic, revealing two distinct clusters (AA_TS >> 0.5). b) An extreme case of overfitting in which the nearest neighbours of real data points are systematically synthetic and vice versa (AA_TS << 0.5). c) An extreme case of a specific type of generative aberration in which the nearest neighbours of both real and synthetic data points are synthetic (AA_syn >> 0.5 and AA_truth << 0.5). Hypothetically, this might occur when the generator generates new instances based on average information from a small collection of samples, causing low local variation. d) An extreme case of a specific type of generative aberration in which the nearest neighbours of both real and synthetic data points are real (AA_syn << 0.5 and AA_truth >> 0.5). This might possibly be observed when the generative model learns the main modes in real data but fails to mimic the densities and generates instances in the axes of the main modes with high variance.

More »

Expand

Fig 5.

Principal component, allele frequency and linkage disequilibrium (LD) analyses of artificial genomes with 65,535-SNP size.

a) Density plot of the PCA of combined real genomes and artificial genomes generated by WGAN and CRBM. Density increases from red to blue. b) Allele frequency correlation between real and artificial genome datasets. Bottom figures are zoomed at low frequency alleles (from 0 to 0.2 overall frequency in the real dataset). Values presented inside the figures are Pearson’s r, ordinary least squares regression slope and intercept. The dashed black line is the identity line. c) LD decay approximation for real (grey), WGAN generated (blue) and CRBM generated (red) genomes (see Materials and methods for details). d) Nearest neighbour adversarial accuracy (AATS) of artificial genomes generated by different models for the 65,535-SNP dataset. Values below 0.5 (black line) indicate overfitting and values above indicate underfitting.

More »

Expand

Table 1.

Nearest neighbour chain analysis.

Frequencies of series of generated (synthetic—S) and real (truth—T) samples in chains of nearest neighbours (from size 2 to 5). To avoid loops, a sample was removed once reached in the chain. Expected frequency for chains of size 2 is 0.25, size 3 is 0.125, size 4 is 0.0625 and size 5 is 0.03125.

More »

Expand