The impact of different negative training data on regulatory sequence predictions

Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.

Introduction 57 Regulatory sequences play an important role in the control of transcription initiation. Variants 58 in regulatory elements can lead to changes in gene expression patterns and are associated 59 with various diseases [1][2][3]. Deciphering the encryption of regulatory activity in genomic 60 sequences is an important goal and an improved understanding will inevitably contribute to a 61 better interpretation of personal genomes and phenotypes. While available approaches for 62 measuring changes in regulatory sequences activity in a native genomic context are still very 63 limited in their throughput [4], machine learning methods can be applied for regulatory activity 64 prediction directly from DNA sequence and reveal enriched sequences patterns and 65 arrangements [5].

66
There is a strong link between transcription factors (TFs) binding to regulatory elements and 67 general DNA accessibility, i.e. open chromatin. While the screening of individual TFs is tedious 68 and restricted by the availability of appropriate antibodies, chromatin accessibility can be 69 measured genome-wide and in multiple assays (e.g. DNase-seq, ATAC-seq or NOMe-seq). 70 DNase I hypersensitive site sequencing (DNase-seq) provides a gold-standard for the 71 detection of chromatin accessibility [6] and is widely used by the ENCODE Consortium as a 72 sensitive and precise reference measure for mapping regulatory elements [7,8]. It allows the 73 detection of active regulatory elements, marked by DNase I hypersensitive sites (DHS), across 74 the whole genome [9,10].

75
Machine learning approaches identify regulatory elements among other coding or non-coding 76 DNA sequences based on structured patterns of their DNA sequences. Many of these patterns 77 can be matched to known transcription factor binding sites (TFBSs) [11,12] and their relative 78 orientation and positioning. TFs are known to have different binding affinities to DNA 79 sequences and to bind preferentially to a specific set of short nucleotide sequences named 80 binding motifs [11]. Further, TFs can have preferences for a three-dimensional structure of the 81 DNA [12]. While DNA structure can be predicted from the local sequence context, the same 82 DNA shape can be encoded by different nucleotide sequences. There are probably additional 83 patterns, but GC-related sequence features are commonly identified as predictors of 84 regulatory activity and can affect nucleosome occupancy due to differential DNA binding 85 affinity of histone molecules [13].

86
Gapped k-mer support vector machines (gkm-SVMs) [14-16] and convolutional neural 87 networks (CNNs) [17][18][19] have been recently applied in multiple studies to either predict 88 regulatory activity/function or to identify key elements of the activity-to-sequence encoding. 89 While DHS datasets serve as positive training data for these machine learning algorithms, the 90 ideal composition of the negative training dataset is still an unsolved question. There are two 91 commonly used approaches for the generation of negative training data, the selection of 92 sequences from genomic background [ replicates were merged into one file per experiment, combining overlapping (minimum of 1 bp) 137 or adjacent sequences into a single spanning sequence. For cell lines A549 and MCF-7 two 138 pooled DHS datasets exist (S1 Table), we refer to those as experiments A and B. DHS regions 139 were defined 300 bp around the center of the narrow peaks and reference genome sequences 140 used (GRCh38 patch release 7, GRCh38.p7). Sequences located on alternative haplotypes, 141 on unlocalized genomic contigs, or containing non-ATCG bases were excluded. An overview 142 of the used DNase-seq datasets is presented in S1 found for at least 80% of the samples in each dataset, the batch size and maximum number 156 of trials were increased (batchsize=10000, nMaxTrials=100). The tolerance for differences in 157 repeat ratio and relative sequence length were set to 0, but the tolerance for differences in GC 158 content was varied for different training datasets (t GC ={0.02, 0.05, 0.1}). 159 To generate neutral DNA sequence for the negative training dataset, positive sequences were 160 shuffled while preserving the k-mer counts. Here, k-mer shuffling datasets were generated 161 using fasta_ushuffle (https://github.com/agordon/fasta_ushuffle, accessed 02/26/2020), a 162 wrapper for the fasta file format to uShuffle [24]. The parameter k which indicates the size of 163 the preserved k-mers was varied for different datasets (k= [1,7] interface [27]. The Adam optimizer [28] was used with default parameters as previously 212 suggested [29]. In addition to the default parameters for batch size (200) and learning rate 213 (0.001), a different parameter set was examined (batch size = 2000, learning rate = 0.0002). 214 For both architectures, the higher batch size and lower learning rate were chosen based on 215 accuracy and standard deviation on the validation set (chromosome 21 hold-out, regulatory 216 activity task). Models were trained over 20 epochs showing a convergence of the estimated 217 loss on the validation sets and no signs of overfitting (see S1 Figure and S2 Figure). Network 218 training was repeated 10 times using different seeds. For regulatory activity and tissue-specific 219 activity prediction, one out of the 10 models was chosen for further analysis based on median 220 model performance (chromosome 21 hold-out). experiments. To select these models, we compared their performance on a hold-out set of 299 active DHS regions on chromosome 21 (validation set). Since we did not observe relevant 300 effects for parameters of the genomic background set (S3 Figure and S4 Figure), we chose the most stringent parameter (t GC =0.02). In contrast, when comparing models trained on 302 shuffled sequences, model performance depended on the size of preserved k-mer k (S5 303 Figure and S6 Figure), with small k resulting in better performance and high k falling behind 304 the genomic background sets. We note that the value of k is anticorrelated to the number of 305 known transcription factor binding site (TFBS) motifs remaining in the negative training 306 sequences (S7 Figure) and suggests that models may identify positive samples based on 307 TFBS frequency. While models with k=1 show the best results, we chose k=2 as shuffled 308 sequences preserving dinucleotide composition are widely used [20]. 309 Selected models were then compared across classifiers on a second chromosome hold-out 310 dataset (chromosome 8, test set). In accordance with previous studies, CNNs and gkm-SVM 311 classifiers are both able to predict active DHS regions from the hold-out sets with high recall 312 and AUPRC values (S8 Figure). We do not see a clear difference between the two CNNs 313 tested. However, models trained on highly shuffled data perform significantly better than 314 models trained on genomic background data; potentially the result of an improper evaluation 315 on varying compositions of the validation sets using different negative data. To explore further, how models were influenced by the negative sets, we analyzed 8-mers in 339 the different test data set classes as well as 8-mers prioritized in the first convolutional layer 340 of our CNN models. We compared these 8-mers based on genomic frequency across all 341 human autosomes. We observe that 8-mers in the genomic background negative sets are on 342 average more frequent than 8-mers from DHS sites (positive sets) and those are more 343 frequent than 8-mers from shuffled negative sequences (S10A Figure). While effects are more 344 subtle, similar effects propagate into 8-mers identified in the first convolutional layers (S10B 345 Figure), with models trained on genomic background sequences learning to identify more 346 common 8-mers (Wilcoxon rank tests, p < 2.2e -16 ). Consequently, rare motifs in shuffled 347 negative sequences are learned by these models and may negatively impact model 348 performance. 349 For A549 and MCF-7 cell lines with two available DHS sets from ENCODE, two separate 350 models were trained and their performance on the test sets compared among all cell lines. We 351 see that performance generalizes well across diverse cell lines (e.g. breast, cervix, lung, liver 352 cancer and leukemia), suggesting that organismal rather than tissue-specific active regulatory 353 regions are predicted. As an example, Table 1  To further assess the models' capability to predict tissue-specific regulatory activity, we used 371 datasets containing tissue-specific DHS sequences for further testing. We selected DHS 372 sequences only active in the training cell line (positive samples) and DHS regions not active 373 in this cell line but active in at least one of the other cell lines (negative samples). 374 Again, we first tested parameter choice on a validation set (chromosome 21 hold-out). We 375 notice that performance is considerably reduced compared to the first task and see big 376 differences regarding model performance across different training cell lines (S5 and S6 377 Tables). Since HeLa-S3 models performed best, we focused on this cell line. While models 378 trained using genomic background showed similar performance independent of the GC 379 content tolerance (S11 Figure), performance was dependent on k for shuffled sequences. 380 Model performance tends to increase with higher size of preserved k-mers in shuffled 381 sequences (S12 Figure). For the genomic background set, we chose again the most stringent 382 parameter (t GC =0.02) and for shuffled sequences k=7 based on precision recall. This high 383 value of k preserves a number of TFBS motifs (46±2 motifs per 300 bp) similar to the positive 384 set (47±2 motifs per 300 bp, S7 Figure), suggesting that presence of tissue-specific factors as 385 well as relative positioning may be most critical for model performance.  Fig. 2A/2B), 4conv2pool4norm (Fig. 2C/2D) 399

387
and gkm-SVM (Fig. 2E/2F) models. Predicting tissue-specific regulatory activity, the 400 performance of models is low, but models trained on genomic background data generally 401 perform better than models trained on shuffled sequences (e.g. AUROC differences of 402 6.7/6.8% for the two different CNN architectures Since enhancer activity was measured in HepG2 cells, we first applied our models trained on 412 HepG2 DHS data. In contrast to earlier results, model performance differs across models 413 trained using different GC content matching of the genomic background datasets. Models 414 trained on sequences that varied most from positive sequences regarding their GC content, 415 performed best (S13 Figure). Therefore, this less stringent matching parameter was 416 considered here. Next, the shuffling parameter k was evaluated on enhancer activity prediction 417 for HepG2 models. Here, the extremes, i.e. models trained on highly shuffled sequences (k=1) 418 or models with low shuffling (k=7) performed worse for the different model types (S14 Figure). 419 Best performance is achieved for k={3,4} for gkm-SVM, while for the CNN architectures 420 k={5,3} perform best. Based on these results, the parameter k=3 was chosen.

431
Our HepG2 models did not achieve the performance of a Spearman's ρ of 0.28 reported before 432 [23] (see Fig 3). Therefore, other cell-type models were also tested and A549, HeLa-S3 and 433 K562 models achieved or exceeded the reference performance (Fig 3 incl. HepG2 and K562,  434 further cell-types see S15 Figure). Compared to others, the HepG2 training set is smaller 435 (123k compared to 281k HeLa-S3, 222k K562 and 192k A549, S1 Table). To investigate 436 whether the size of the training dataset influences model performance, new models were 437 trained on datasets of varying size (50k to 600k), by sampling sequences from all cell lines (see Methods). We note that sampling across cell lines dilutes a tissue-specific signal and we 439 expect that correlation with experimental readouts might be reduced.

440
Again, we evaluated the correlation of prediction scores and activity readouts. Results are 441 presented in Fig 4. Model performance of gkm-SVM classifier seems very stable across 442 training set sizes and repeated training runs, but due to runtime we did not test more than 443 350,000 positive training examples. Using genomic background sequences clearly 444 outperformed shuffled sequences. For CNNs, the more complex architecture 445 (4conv2pool4norm) outperformed 2conv2norm on both negative sets. To achieve or exceed 446 the gkm-SVM performance, 4conv2pool4norm required larger training datasets (6-7x more 447 data). Looking across 10 trained CNN models per data set, we see considerable variance in 448 model performance, suggesting high stochasticity in training, likely originating from non-449 optimal parameters (e.g. batch size, learning rate, convergence). Gkm-SVM (0.29) and 450 4conv2pool4norm models (0.30) both exceeded the reference Spearman's ρ value (0.28, Fig  451  4

465
We found that CNN models and gkm-SVM models are equally suited for active DHS 466 prediction. While similar in performance, CNN models showed larger variance across training 467 runs and the smaller 2conv2norm network architecture reduced performance on genomic 468 background sets. These and results of k-mer shuffled negative sets suggest that models 469 primarily learn representation differences of short motifs. We note that we selected all shuffles 470 to minimize the 8-mer overlap with the positive sequence template, i.e. sequences that mutate 471 the overall motif positioning. We could also show that k-mer size is correlated to the number 472 of known TFBS motifs found in the negative training sequences and that shuffled sequences 473 have a higher proportion of rare genomic 8-mers than DHS sequences and genomic 474 background sequences. We suggest that learning rare motifs is the reason that model 475 performance for active DHS prediction seems highest when using highly shuffled sequences 476 (k={1..3}) as negative training data, but drops considerably when applying models to validation 477 sets using genomic background negative sets. Independent of that effect, genomic 478 background sequences also outperformed shuffles for k higher than 4 for active DHS 479 prediction.

480
Since shuffled sequences are artificial and lack biological constraints, models based on this 481 kind of negative set may learn differential sequence motif representations that correspond to 482 genuine TFBS motifs (both active or inactive in the specific cell-type) and differential motif 483 representation due to other biological constraints (e.g. underrepresentation of CpG 484 dinucleotides). While density of binding sites was previously shown to be predictive of 485 regulatory activity [38,23], quantitative and tissue-specific predictions require the models to 486 learn motifs directly related to sequence activity (e.g. active TFBS in a certain cell-type). 487 Consequently, for the two tasks of tissue-specific activity and quantitative activity prediction, 488 genomic background sequences perform always better than sequence shuffles. In line with these observations, models trained on longer preserved k-mers perform better for these tasks, 490 while still falling behind models using the genomic background. We conclude that with 491 background genomic sequences as negative training data, model training tends to ignore 492 patterns present in natural DNA sequences and is able to focus on more subtle differences in 493 binding site representation.

494
These patterns are consistent across gkm-SVM and CNN models. On the "complex" tasks, 495 gkm-SVM models outperformed the CNN models in our setup. While we do not see a clear 496 difference between CNN architectures for tissue-specific DHS regions, in the quantitative 497 enhancer activity predictions, the more complex 4conv2pool4norm architecture performs 498 considerably better. For biologically meaningful results, appropriate training datasets are 499 always required and we showed on this last task that training set sizes for CNNs need to be 500 much larger to reach gkm-SVM model performance. The amount of training data is also just 501 one parameter that influences CNN model performance and there are many other network 502 and training hyperparameters that can be tuned.

503
The quantitative predictions also revealed an issue with the commonly used software package 504 for drawing background sequences from the genome. While in the first two tests, the GC 505 matching parameter did not seem to make a difference, a larger deviation in GC matching 506 provided a performance increase in quantitative enhancer activity prediction. Concurrently, the 507 HepG2 enhancer activity readouts show a positive correlation of GC content with enhancer 508 activity (Spearman ρ of 0.24 with MaxGC feature in the previous publication, [23]). We 509 therefore looked more rigorously at the GC matching and noticed that even for the most 510 stringent setting, high GC-content DHS regions are not sufficiently matched with genomic 511 background sequences (S16 Figure). This causes the models to learn sequence GC content 512 as predictive of regulatory activity rather than specific sequence patterns. We need to highlight 513 a necessary balance in sequence matching attempts though. matching should be improved to closely reproduce the continuous GC density distribution of 537 the positive set rather than a mean and standard deviation. Further, for both types of negative 538 sets, it is only assumed that sequences are regulatory inactive. For the shuffles this assumption is based on the artificial nature of sequences, for the background it is based on 540 the excluded overlap with active sequences. While this might generally argue for semi-541 supervised learning approaches, comprehensive positive sets may somewhat alleviate the 542 issue for genomic background sets.

543
Comparing two different machine learning approaches, we show that gkm-SVMs give very 544 robust and good results, while CNNs performance could be improved by larger training 545 datasets. This is inline with gkm-SVMs being the simpler machine learning approach (despite 546 being slower in their current implementation) and we see this as a cautionary reminder to keep 547 models simple, especially if training data is limited. Apart from the negative training data 548 analyzed here, network architecture and training parameters of CNNs should be explored and 549 optimized in future work. The parameter space of CNNs is immense and remains largely 550 underexplored. Further, multi-task CNN implementations show improved performance [18,40], 551 potentially also due to the effective increase in training data. However, to focus our analysis 552 on the effects of the negative set and to keep comparisons to gkm-SVMs possible, we did not 553 include these here.

554
To conclude, this study provided relevant insights about how regulatory activity is encoded in 555 DNA sequence, like highlighting the importance of short sequence motifs, and yielded 556 important insights for training machine learning models. We show that negative training data 557 is of high importance for model performance and that the best results are obtained when using 558 sufficiently large and well-matched genomic background datasets. Comparing different 559 learners, we see that gkm-SVMs are very robust and provide good overall performance. While 560 CNNs have the potential to outperform these simpler models, they require careful attention to 561 the selection of adequate architectures and hyperparameter optimization. While not a focus of 562 this work, models may be further interpreted with respect to their sequence features learned 563 [41,42], in order to shed more light upon the sequence encoding of gene regulation.

565
Acknowledgements 566 We thank current and previous members of the Kircher group for helpful discussions and 567 suggestions. Specifically, we would also like to acknowledge input from Giorgio Valentini and 568 his lab at Università degli Studi di Milano, as well as Dirk Walther at the University of Potsdam. 569 This work was supported by the Berlin Institute of Health and Charité -Universitätsmedizin 570 Berlin. The funder had no involvement in study design; in the collection, analysis and 571 interpretation of data; in the writing of the report; and in the decision to submit the article for 572 publication.