Creating Artificial Human Genomes Using Generative Models

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation of this field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. Here we demonstrate that we can train deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) to learn the high dimensional distributions of real genomic datasets and create artificial genomes (AGs). Additionally, we ensure none to little privacy loss while generating high quality AGs. To illustrate the promising outcomes of our method, we show that augmenting reference panels with AGs improves imputation quality for low frequency alleles. In summary, AGs have the potential to become valuable assets in genetic studies by providing high quality anonymous substitutes for private databases.

due to concerns about violations of individual privacy, although they would provide a 23 rich resource for data mining and integration towards advancing genetic studies. Here 24 we demonstrate that we can train deep generative adversarial networks (GANs) and 25 restricted Boltzmann machines (RBMs) to learn the high dimensional distributions of 26 real genomic datasets and create artificial genomes (AGs). Additionally, we ensure 27 none to little privacy loss while generating high quality AGs. To illustrate the promising 28 outcomes of our method, we show that augmenting reference panels with AGs 29 Introduction 33 Availability of genetic data has increased tremendously due to advances in sequencing 34 technologies and reduced costs (Mardis 2017). The vast amount of human genetic 35 data is used in a wide range of fields, from medicine to evolution. Despite the 36 advances, cost is still a limiting factor and more data is always welcomed, especially 37 in population genetics and genome-wide association studies (GWAS) which usually 38 require substantial amounts of samples. Partially related to the costs but also to the 39 research bias toward studying populations of European ancestry, many autochthonous consists in learning the data distribution in a way such that the new instances created 61 by the generator cannot be distinguished from true data by the discriminator. Since 62 their first introduction, there have been many successful applications of GANs, ranging 63 from generating high quality, realistic imagery to gap filling in texts (Ledig et

96
Reconstructing genome wide population structure: 97 Initially we created AGs with GAN, RBM, and two simple generative models for    After demonstrating that our models generated realistic AGs according to the 182 described summary statistics, we investigated further whether they respected privacy 183 by measuring the extent of overfitting. We calculated two metrics of resemblance and 184 privacy, the nearest neighbour adversarial accuracy (AATS) and privacy loss presented 185 in a recent study (Yale et al. 2019). AATS score measures whether two datasets were 186 generated by the same distribution based on the distances between all data points and 187 their nearest neighbours in each set. When applied to artificial and real datasets, a 188 score between 0.5 and 1 indicates underfitting, between 0 and 0.5 overfitting (and likely 189 privacy loss), and exactly 0.5 indicates that the datasets are indistinguishable. By using 190 an additional real test set, it is also possible to calculate a privacy loss score that is

Linking genotypes with phenotypes: 226
We then explored the possibility of creating AGs with unphased genotype data and 227 recreating phenotype-genotype associations using generative models. As a proof of 228 concept, we created GAN AGs via training on 1925 Estonian individuals with 5000 229 SNPs using unphased genotypes instead of haplotypes. There was an additional 230 column in this dataset representing eye color (blue or brown). This region 231 encompasses rs12913832 SNP which is highly associated with eye color (Han et al.  Table 1). We weren't able to create 239 RBM AGs for this dataset. 240

241
In this study, we applied generative models to produce artificial genomes and 242 evaluated their characteristics. To the best of our knowledge, this is the first application 243 of GAN and RBM models in this context, displaying overall promising applicability. We 244 showed that population structure and frequency-based features of real populations can 245 successfully be preserved in AGs created using GAN and RBM models. Furthermore, 246 both models can be applied to sparse or dense SNP data given a large enough number 247 of training individuals. Our different trials showed that the minimum required number 248 of individuals for training is highly variable, possibly correlated with the diversity among 249 individuals (data not shown). Since haplotype data is more informative, we created 250 haplotypes for the analyses but we also demonstrated that the models can be applied 251 to genotype data too, by simply combining two haplotypes if the training data is not 252 phased (see Materials & Methods). In addition, we showed that it is possible to 253 generate AGs with simple phenotypic traits through genotype data (see Results). Even 254 though there were only two simple classes, blue and brown eye color phenotypes, 255 generative models can be improved in the future to hold the capability to produce 256 artificial datasets combining AGs with multiple phenotypes. 257

258
One major drawback of the proposed models is that, due to computational limitations, 259 they cannot yet be used to create whole artificial genomes but rather snippets or 260 sequential dense chunks. Although parallel computing might be a solution, this might 261 further disrupt the haplotype structure in AGs. Instead, adapting convolutional GANs 262 for AG generation might be another possible solution in the future . 263 Another problem arose due to rare alleles, especially for the GAN model. We showed 264 that nearly half of the alleles become fixed in the GAN AGs in the 10K SNP dataset, 265 whereas RBM AGs seem to capture more of the rare alleles present in real genomes 266  Figure 10) and AATS scores ( Figure  280 3a), although this can be investigated further by integrating AATS scores within our 281 models as a criterion for early stopping in training (before the networks start overfitting). 282 In the context of the privacy issue, GAN AGs have a slight advantage since underfitting 283 is preferable. More distant AGs would hypothetically be harder to be traced back to the 284 original genomes. We also tested the sensitivity of the AATS score and privacy loss 285 (Supplementary Figure 14). It appears that both scores are affected very slightly when 286 we add only a few real genomes to the AG dataset from the training set. Therefore, 287 more sensitive measurement techniques should be developed in the future for better 288 assessment of generated AGs. Additionally, even though we did not detect exact 289 copies of real genomes in AG sets created either by RBM or GAN models, it is a very 290 complicated task to determine if the generated samples can be traced back to the 291 originals. Reliable measurements need to be developed in the future to assure 292 complete anonymity of AGs to their source. 293 294 Imputation results demonstrated promising outcomes especially for population specific 295 low frequency alleles. However, imputation with both RBM and GAN AGs integrated 296 reference panels showed slight decrease of info metric for higher frequency alleles 297 compared to only 1000 Genomes panel (Figure 3c). We initially speculated that this 298 might be related to the disturbance in haplotypic structure and therefore, tried to filter 299 AGs based on chunk counts from ChromoPainter results, preserving only AGs which 300 are below the average chunk count of real genomes. The reason behind this was to 301 preserve most real-alike AGs with undisturbed chunks. Even with this filtering, slight 302 decrease in higher MAF bins was still present (data not shown). Yet results of 303 implementation with AGs for low frequency alleles and without AGs for high frequency 304 ones could be combined to achieve best performance. In terms of imputation, future 305 improved models can become practically very useful, largely for GWAS studies in 306 which imputation is a common application to increase resolution. Different generative 307 models such as MaskGAN (Fedus et al. 2018) which demonstrated good results in text 308 gap filling might also be adapted for genetic imputation. RBM is possibly another option 309 to be used as an imputation tool directly by itself, since once the weights have been 310 learned, it is possible to fix a subset of the visible variables and to compute the average 311 values of the unobserved ones by sampling the probability distribution (in fact, it is 312 even easier than sampling entirely new configurations since the fixed subset of 313 variables will accelerate the convergence of the sampling algorithm). Although there are some current limitations, generative models will most likely become 325 prominent for genetics in the near future with many promising applications. In this work, 326 we demonstrated the first possible implementations and use of AGs in the forthcoming 327 field which we would like to name artificial genomics. 328