Deep convolutional and conditional neural networks for large-scale genomic data generation
Fig 5
Principal component, allele frequency and linkage disequilibrium (LD) analyses of artificial genomes with 65,535-SNP size.
a) Density plot of the PCA of combined real genomes and artificial genomes generated by WGAN and CRBM. Density increases from red to blue. b) Allele frequency correlation between real and artificial genome datasets. Bottom figures are zoomed at low frequency alleles (from 0 to 0.2 overall frequency in the real dataset). Values presented inside the figures are Pearson’s r, ordinary least squares regression slope and intercept. The dashed black line is the identity line. c) LD decay approximation for real (grey), WGAN generated (blue) and CRBM generated (red) genomes (see Materials and methods for details). d) Nearest neighbour adversarial accuracy (AATS) of artificial genomes generated by different models for the 65,535-SNP dataset. Values below 0.5 (black line) indicate overfitting and values above indicate underfitting.