Deep learning reveals evolutionary conservation and divergence of sequence properties underlying gene regulatory enhancers across mammals

In mammals, genomic regions with enhancer activity turnover rapidly; in contrast, gene expression patterns and transcription factor binding preferences are largely conserved. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. We tested this hypothesis by quantifying the conservation of sequence patterns underlying histone-mark defined enhancers across six diverse mammals in two machine learning frameworks. We first trained support vector machine (SVM) classifiers based on the frequency spectrum of short DNA sequence patterns. These classifiers accurately identified many adult liver, developing limb, and developing brain enhancers in each species. Then, we applied these classifiers across species and found that classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. This indicates that the short sequence patterns predictive of enhancers are largely conserved. We also observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. To test the conservation of more complex sequences patterns, we trained convolutional neural networks (CNNs) on enhancer sequences in each species. The CNNs demonstrated better performance overall, but worse cross-species generalization than SVMs, suggesting the importance of combinatorial interactions between motifs, but less conservation of these more complex sequence patterns. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Furthermore, short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution, with evolutionary change in more complex sequence patterns.

sequence patterns underlying histone-mark defined enhancers across six diverse mammals in two machine 23 learning frameworks. We first trained support vector machine (SVM) classifiers based on the frequency 24 spectrum of short DNA sequence patterns. These classifiers accurately identified many adult liver, 25 developing limb, and developing brain enhancers in each species. Then, we applied these classifiers 26 across species and found that classifiers trained in one species and tested in another performed nearly as 27 well as classifiers trained and tested on the same species. This indicates that the short sequence patterns 28 predictive of enhancers are largely conserved. We also observed similar cross-species conservation when 29 applying the models to human and mouse enhancers validated in transgenic assays. The sequence patterns 30 most predictive of enhancers in each species matched the binding motifs for a common set of TFs 31 enriched for expression in relevant tissues, supporting the biological relevance of the learned features. To 32 test the conservation of more complex sequences patterns, we trained convolutional neural networks 33 (CNNs) on enhancer sequences in each species. The CNNs demonstrated better performance overall, but 34 worse cross-species generalization than SVMs, suggesting the importance of combinatorial interactions 35 between motifs, but less conservation of these more complex sequence patterns. Thus, despite the rapid 36 change of active enhancer locations between mammals, cross-species enhancer prediction is often 37 possible. Furthermore, short sequence patterns encoding enhancer activity have been maintained across 38 more than 180 million years of mammalian evolution, with evolutionary change in more complex 39 sequence patterns. 40 41 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  Author summary 42 Alterations in gene expression levels are a driving force of both speciation and complex 43 disease; therefore, it is of great importance to understand the mechanisms underlying the evolution and 44 function gene regulatory DNA sequences. Recent studies have revealed that while gene expression 45 patterns and transcription factor binding preferences are broadly conserved across diverse animals, there 46 is extensive turnover in distal gene regulatory regions, called enhancers, between closely related species.

47
We investigate this seeming incongruence by analyzing genome-wide enhancer datasets from six diverse 48 mammalian species. We trained two machine-learning classifiers-a k-mer spectrum support vector 49 machine (SVM) and convolutional neural network (CNN)-to distinguish enhancers from the genomic 50 background. The k-mer spectrum SVM models the occurrences of short sequence patterns while the CNN 51 models both the short sequences patterns and their combinatorial patterns. Both the SVM and CNN 52 enhancer prediction models trained in one species are able to predict enhancers in the same cellular 53 context in other species. However, CNNs performed better at predicting enhancers in each species, but 54 they generalize less well across species than the SVMs. This argues that the short sequence properties 55 encoding regulatory activity are remarkably conserved across more than 180 million years of mammalian 56 evolution with more evolutionary turnover in the more complex combinations of the conserved short 57 sequence motifs.

65
Enhancers are genomic regions distal to promoters that bind transcription factors (TFs) to regulate the 66 dynamic spatiotemporal patterns of gene expression required for proper differentiation and development 67 of multi-cellular organisms [1,2]. It is critical to understand the mechanisms underlying enhancer 68 evolution and function, as alterations in their activity influence both speciation and disease [3][4][5]. Recent 69 genome-wide profiling of TF occupancy and histone modifications associated with enhancer activity 70 revealed that the regulatory landscape changes dramatically between species-both enhancer activity and 71 TF occupancy at orthologous regions distal to promoters are extremely variable across closely related 72 mammals [6][7][8][9][10][11][12]. However, the gene regulatory circuits [13] and expression of orthologous genes in 73 similar tissues are largely conserved across mammals [14][15][16]. Much of the gene regulatory machinery is 74 also conserved; TFs and the short DNA motifs they bind are highly similar between human, mouse, and 75 fly [17][18][19][20]. In short, there is considerable change in the enhancer activity of orthologous regions across 76 mammals, despite the relative conservation of gene expression and TF binding preferences.

77
The rapid turnover in enhancer activity between orthologous regions in different species has 78 largely been attributed to differences in the DNA sequences of the elements involved, rather than 79 differences in the broader nuclear context [21][22][23][24][25]. Genome-wide profiles of TF binding have shown that 80 60-85% of binding differences in human, mouse, and dog for the TFs CEBPα and HNF4α can be 81 explained by genetic variation that disrupts their binding motifs [23]. Genetic differences are also often 82 responsible for differential enhancer activity between more closely related species; for example, variation 83 in TF motifs at orthologous enhancers was predictive of activity differences between human and chimp 84 neural crest enhancers [25]. This suggests that, while there is turnover at orthologous sequences, sequence 85 properties predictive of enhancer activity may still be conserved.

86
Until recently, investigation of the conservation of enhancer sequence properties across 87 mammalian evolution has been hampered by a lack of known enhancers across diverse species within the 88 same cellular context. The canonical definition of enhancer activity is the ability to drive expression in 89 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  5 transgenic reporter assays [1,26], which cannot currently be scaled to assess regulatory potential genome-90 wide. However, high-throughput assays such as ChIP-seq can assess histone modifications associated 91 with enhancer activity [27,28] to identify putative enhancers genome-wide in many tissues and species 92 [12,29]. Using known enhancers, machine learning approaches have learned their sequence properties and 93 successfully distinguished enhancers active in specific cellular contexts from both the genomic 94 background and enhancers active in other tissues [30][31][32][33][34][35][36][37][38][39]. Moreover, some of these studies suggested the 95 potential for cross-species enhancer prediction. For instance, the similarity of co-occurrence of sequence 96 patterns can be used to identify orthologous enhancers in distantly related Drosophila species [40]. 97 Furthermore, annotated cis-regulatory modules (CRMs) in Drosophila can predict CRMs in highly 98 diverged insect species based on binding site composition similarity [41]. However, TF binding sites 99 appear to evolve and turnover much more rapidly between closely related mammals than Drosophila 100 species [10,42]. In mammals, a machine learning model trained with mouse enhancers accurately predict 101 orthologous regions of the human genome [31]; however, the rapid turnover of enhancer activity between 102 human and mouse suggests that the majority of these orthologous regions are not human enhancers [12].

103
These previous studies suggest the potential for evolutionary conservation of sequence properties of 104 mammalian enhancers, but comprehensive genome-wide quantification of the degree and dynamics of 105 this conservation is needed.

106
In this study, we investigate the degree of regulatory sequence property conservation by applying 107 machine learning classifiers to genome-wide enhancer datasets across diverse mammals. We first confirm 108 that Support Vector Machine (SVM) classifiers trained using short DNA sequence patterns can accurately 109 identify many enhancers genome-wide in the adult liver [12] , developing limb and developing brain [29]. 110 Then, by using classifiers trained in one species to predict enhancers in the others, we demonstrate that 111 enhancer sequence properties are conserved across species, even though the enhancer activity of specific 112 loci is not. We establish the robustness of this conservation to different enhancer identification techniques 113 by showing that classifiers trained using high-confidence human and mouse enhancer sequences validated 114 in transgenic assays also generalize across species, and are similar to classifiers trained on histone-115 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  6 modification-defined enhancers. Furthermore, the short DNA patterns most predictive of enhancer 116 activity in each species matched a common set of binding motifs for TFs enriched for expression in 117 relevant tissues. This suggests the patterns learned by classifiers capture biologically relevant sequences 118 that influence TF binding. In addition to SVM classifiers, we also trained convolutional neural networks 119 (CNNs) on enhancers in each species. The multilayer structures of CNNs can learn predictive short DNA 120 motifs as well as combinations of motifs at different levels of complexity, and therefore are promising for 121 modeling the complex interactions between [43][44][45]. The CNNs predicted enhancers with 122 higher accuracy than k-mer SVM models, but the CNNs generalized less well across species, suggesting 123 less conservation of more complex sequence patterns. Together, our results argue that, though there is 124 rapid change of active gene regulatory sequences between mammalian species, the short sequence 125 patterns of the enhancer regions encoding regulatory activity have been conserved over 180 million years 126 of mammalian evolution. Furthermore, the combinatorial rules combining these short sequence patterns 127 may be more divergent between species. Our findings also suggest avenues for improved enhancer 128 identification within and between species and establish a framework for future exploration of the 129 conservation and divergence of regulatory sequence properties between species.

132
Enhancers can be predicted from short DNA sequence patterns in mammals 133 Genome-wide enhancer activity across many mammalian species was recently assayed by profiling 134 enhancer-associated histone modifications in the adult liver [12], developing limb [8] and developing 135 brain [46]. Certain chemical modifications to histones, such as acetylation of lysine 27 of histone H3 136 (H3K27ac) and lack of trimethylation of lysine 4 of H3 (H3K4me3), are significantly associated with 137 active enhancers. Determining the genomic locations of these modifications via ChIP-seq provides a 138 genome-wide proxy for the active enhancer landscape [27,28]. For brevity, we refer to genomic regions 139 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  7 with enhancer-associated histone modification combinations identified in these previous studies as 140 "enhancers." 141 For each species and tissue, we evaluated how well short DNA sequence patterns identified 142 enhancers. We quantified DNA sequence patterns present in each genomic region by computing its k-mer 143 spectrum-the observed frequencies of all possible nucleotide substrings of length k. We then trained 144 SVM classifiers on the k-mer spectra to distinguish enhancers from random genomic regions matched to 145 the enhancers on various attributes, such as length, GC-content, and repeat-content, as appropriate. We 146 trained and evaluated our classifiers on both unbalanced and balanced positive and negative sets (see 147 CNN results and Methods). The unbalanced set contains ten times as many negative non-enhancer regions 148 as enhancers to reflect the fact that most of the genome does not have enhancer activity; in this test, we 149 trained with different misclassification costs for positives and negatives (Methods). We report the 150 unbalanced results in this section and the balanced results in the comparisons with CNN models below.

151
We used ten-fold cross validation to evaluate classifier performance. We quantified performance by 152 computing the average area under receiver operating characteristic (auROC) and precision-recall (auPR) 153 curves over the ten cross-validation runs (Figure 1; Methods).

154
We first evaluated the ability of classifiers trained on 5-mer spectra to identify liver enhancers in 155 six representative mammals: human, macaque, mouse, cow, dog and opossum. These species were 156 selected as representatives, since they each come from a different clade and have high-quality genome  Figure S1a). Next, we trained 5-mer spectrum SVM classifiers to predict enhancers active in 160 limb and brain for human, macaque, and mouse. Again, classifiers accurately distinguished enhancers 161 from the background with even stronger performance than the liver classifiers. The limb classifiers 162 achieved auROCs of ~0.89 in each species (Figure 2b; PR curves in Figure S1b), and the brain classifiers 163 had auROCs from 0.90-0.93 (Figure 2c; PR curves in Figure S1c).

164
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  8 The choice of k did not significantly influence performance; the auROCs for human liver 165 classifiers are 0.81, 0.82, 0.82, 0.82, respectively across a k of 4, 5, 6, and 7. We also explored the 166 application of classifiers based on more flexible k-mer features, i.e., the gappy and mismatch k-mer 167 kernels [48], but they did not significantly improve performance ( Figure S2

187
Classifiers generalized better to more closely related species; generalization was significantly 188 inversely correlated with the species' evolutionary divergence, as quantified by substitutions per neutrally 189 evolving site ( Figure S5, Pearson's r = -0.585, P = 0.022). Furthermore, classifiers trained to identify 190 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online Feb. 21, 2017; 9 enhancers in developing limb and brain also accurately generalized across species. The average relative 191 auROC for the developing limb classifiers was 95.0% across all species pairs ( CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  human, 0.76 in mouse) competitively with the VISTA-trained limb classifier itself (auROC = 0.81 in 217 human, 0.78 in mouse), suggesting that sequence properties predictive of histone-modification defined 218 enhancers are also predictive of transgenic assay validated enhancers. Thus, in spite of the limited number 219 and biases present in the sequences tested for enhancer activity by VISTA, these analyses demonstrate 220 that our models capture conserved sequence attributes of functionally validated enhancers. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online Feb. 21, 2017; GC-content sequences no longer received consistently high scores ( Figure S13). Ultimately, the strong 242 cross-species generalization of the GC-controlled classifiers suggests that enhancers differ from the 243 genomic background in higher order sequence patterns beyond GC-content, and that those patterns are 244 conserved.

245
The generalization of each liver GC-controlled classifier across species had the same pattern as 246 the classifiers without GC-control: the human classifier had the best generalization (average relative 247 auROC = 96.1%), while the opossum had the worst (average relative auROC = 92.8%). In these GC-248 controlled analyses, we observed a stronger inverse correlation between the relative performance across

267
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  However, the repeat and GC-controlled classifiers still generalized across species (

283
For consistency, we used H3K27ac without H3K4me3 to identify enhancers in these tissues. Next, we 284 compared the relative auROC between the cross-tissue and cross-species prediction tasks ( Figure 5a). In 285 the non-GC-controlled analysis, the human liver enhancer classifier predicts enhancers in macaque, 286 mouse, cow, dog and opossum better than all non-liver Roadmap tissues. In the GC-controlled analysis, 287 we observed the same trend. The cross-species predictions are also more accurate than cross-tissue 288 predictions, with the exception of the Roadmap gastric tissue (dark green), which is also a digestive 289 tissue. When compared to the relative auROCs of all pairwise cross-species analysis in liver, limb and 290 brain, those of human liver to non-liver Roadmap tissues are significantly lower (Figure 5b). In addition 291 to the human cross-tissue analysis, we also examined the cross-tissue performance of the liver, limb and 292 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online Feb. 21, 2017; 13 brain classifiers over all three species with enhancers in three tissues: human, macaque and mouse. For 293 each species, we applied the classifiers trained in liver, limb and brain to that species' enhancers in other 294 two tissues. Again, cross-species performance (all pairwise relative auROCs) was significantly higher 295 than cross-tissue performance in both GC-controlled and non-GC-controlled analyses (Figure 5b). The 296 ability of enhancers to regulate gene expression is often contingent on both cell-type specific attributes, 297 such as expression patterns of TFs [55], and properties that are shared across active enhancers in general.

298
The stronger performance of the trained classifiers in the cross-species compared to cross-tissue 299 prediction tasks suggests that they capture cell-type-specific sequence attributes and that these features 300 are conserved across species.

302
The most predictive sequence patterns in different species match binding motifs for many of the 303 same transcription factors

304
To interpret the biological relevance of the sequence patterns learned by the trained SVM enhancer 305 prediction models in each species, we analyzed the similarity of the sequence properties in their 306 functional context: TF binding motifs. First, we matched the 5% (n = 52) most enhancer-associated 5-307 mers learned by the human GC-controlled liver classifier to a database of 205 known TF motifs [56] 308 using TOMTOM (Figure 6a). The enhancer-associated 5-mers were significantly more likely to match at 309 least one TF motif than expected at random (46.1% vs. 27.7%; one-tailed P = 0.0035, binomial test). The 310 5% (n=52) most background-associated 5-mers were not significantly different from random (21.6% 311 matched at least one TF, two-tailed P = 0.43, binomial test). This illustrates that the classifiers learned 312 sequence patterns with regulatory potential.

313
Next, we investigated whether the TF binding motifs matched by enhancer-associated 5-mers 314 were shared between species. The highly weighted 5-mers in the human-trained classifier matched 121 315 TF motifs. Of these, the binding motifs for 33 TF were also matched by enhancer-associated 5-mers in all 316 other species (Figure 6b, Table S1). This is significant enrichment for shared TF motifs among the 317 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  enhancer-associated 5-mers; only 0.59 TF motifs were shared across all species on average over 100 318 random sets of 5% of 5-mers from each species (Figure 6c). Similarly, only one TF motif (MZF1) was 319 shared among all the species' most background-associated 5-mers. The GC-controlled limb and brain 320 classifiers also shared more TFs among the top 5% of enhancer-associated 5-mers than expected from 321 random sets: 12 TFs were shared among the limb classifiers and 22 were shared among the brain 322 classifiers vs. 8.1 shared TFs expected by chance. However, it is likely that the smaller number of 323 available species for developing limb and brain enhancers, our limited knowledge of binding motifs for 324 TFs active in developing limb and brain, and the heterogeneity of developing limb and brain tissue 325 reduced power to detect sharing compared to liver. We obtained similar results when comparing the TFs 326 matched by 5-mers from non-GC-controlled SVM models ( Figure S17).

327
To evaluate the relevance of the shared TF motifs to liver function, we evaluated the expression 328 patterns of the TFs across 12 tissues [57]. Shared TFs among liver enhancer-associated 5-mers were 329 significantly enriched for liver expression (

342
343 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Although the k-mer SVM model accurately classified many enhancers, its performance was not 347 perfect. We hypothesized that using models with the potential to learn combinatorial interactions 348 between short sequence patterns in enhancers could further improve performance. To model 349 these more complex patterns, we trained convolutional neural networks (CNNs) to distinguish 350 liver enhancers from the genomic background in each species. Here, we used a balanced dataset  Next, we performed the cross-species enhancer prediction with the CNNs. The CNN 361 models generalize well across species (relative auROC from 0.79 to 0.97), but their 362 generalization is consistently worse than the k-mer SVM models (Figure 7c; Raw auROCs and 363 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  auPRs is in Supplementary Figure 18c,d). This suggests that the combinatorial sequence patterns 364 captured by the internal layers of CNN models is less conserved across species then the 365 individual short sequence motifs. 366 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

367
In this study, we trained SVM and CNN classifiers based on DNA sequence patterns to distinguish 368 enhancers from the genomic background in diverse mammalian species. We then showed that, in spite of 369 significant changes in the enhancer landscape between species, the k-mer SVM models trained using short 370 sequence patterns as features exhibited minimal decreases in performance when applied across species.

371
This indicates that short sequence patterns predictive of enhancer activity captured by these models are 372 conserved across mammals. Furthermore, the DNA patterns most predictive of activity across species 373 matched a common set of TF binding motifs with enrichment for expression in the relevant tissues. The 374 sequence properties predictive of histone-mark defined enhancers were also predictive of enhancers 375 confirmed in transgenic reporter assays. We then showed that CNN models performed better than the 376 SVMs at identifying enhancers, but they generalized less well across species. These results suggest that 377 conserved regulatory mechanisms have maintained constraints on short sequence motifs present in 378 enhancers for more than 180 million years, while more evolutionary change in regulatory mechanisms has 379 occurred at the level of combinations of motifs, likely representing cooperative interactions between TFs.

380
Confidently identifying and experimentally validating enhancers remains challenging [50]. We 381 showed that short sequence properties are conserved across species using enhancers identified via two 382 complementary techniques: histone modification profiling and transgenic assays. Each of these 383 approaches has strengths and weaknesses. The histone modification based enhancer predictions enable 384 genome-wide characterizations across many species, but this approach is prone to false positives. On the 385 other hand, the transgenic assays clearly demonstrate the competence of a sequence to drive gene 386 expression, but are restricted to a biased set of relatively few sequences from two species that are tested at 387 one developmental stage. By showing the cross-species conservation is maintained in both categories, and 388 that models trained on each set perform similarly, we argue the conservation of enhancer short sequence 389 properties is robustness to the methodology used to define enhancers.

390
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  18 The design of this study can serve as a framework for further examining the conservation and 391 divergence of regulatory sequence patterns across species. We trained sequence-based machine learning 392 models within a species, and then applied them to other species; this approach can be applied on a 393 genome-wide scale, is not dependent on knowledge of TF binding motifs, and allows some flexibility in 394 the weights assigned to each feature while directly testing the generalization of overall sequence patterns.

415
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

417
We demonstrated that short DNA sequence patterns predictive of enhancer activity learned in one species 418 generalize very well to other mammals. Furthermore, deep neutral network models that can learn complex 419 combinations of short sequence patterns identified enhancers even more accurately, but generalized less 420 well across species. This suggests evolutionary conservation short sequence motifs, but turnover of their

426
There is also the potential to combine sequence-based models with successful cross-species enhancer 427 prediction strategies based on functional genomics data [66]. Nonetheless, much work remains to 428 understand how regulatory programs are robust to sequence changes, yet receptive to functional 429 divergence, and to facilitate our interpretation of the effects of non-coding variants in diverse mammals. 430 431 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  20 Methods 432 Genomic data 433 All work presented in this paper is based on hg19, rheMac2, mm10 (mouse liver dataset), mm9 (mouse 434 limb and brain dataset), bosTau6, canFam3 and monDom5 DNA sequence data from the UCSC Genome

446
We analyzed three multi-species histone-modification-defined enhancer datasets in this study.

447
The first consisted of liver enhancers identified by genome-wide ChIP-seq profiling of histone 448 modifications (H3K27ac without H3K4me3) in 20 species from five mammalian orders [12]. We 449 selected a member of each order with a high-quality genome build for analysis when possible; however, 450 the most diverged order-marsupials-did not have a species with a high-quality genome build. We

468
The third enahncer dataset contained human (N=48853), macaque (N=57446), and mouse (N=51888) 469 enhancers identified from profiling the H3K27ac modification in developing brain tissue [69]. For limb 470 and brain enhancers, we excluded regions within 1 kb of a transcription start site. For each species, we 471 combined the enhancer regions from different development stages. The genomic background regions for 472 each species were defined following the same procedure as for the liver enhancers.

481
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  22 In addition to the histone-modification-defined enhancers, we also analyzed enhancers validated 482 in transgenic reporter assays in embryonic day 11.5 mouse embryos from VISTA

498
To evaluate classifier performance within-species, we performed ten-fold cross validation. In other 499 words, for each set of positives and negatives, the entire data set was randomly partitioned into ten 500 independent sets that maintained the ratio of positives and negatives. Positives and negatives from nine of 501 the ten sets were then used to train the classifier, the trained classifier was then applied to the remaining 502 partition, and these predictions were used to evaluate the classifier. This process was performed ten times, 503 testing each partition once. To summarize performance, we averaged the auROC and auPR over the ten 504 runs. For cross-species classification, we trained on the whole dataset in the training species and 505 evaluated the performance on a random half of the dataset in the test species due to computational 506 limitations of the kebabs package.

507
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  We also evaluated more flexible models, such as the mismatch [48,71] and gappy pair kernels 508 [48,72], These k-mer-based prediction models are similar to the spectrum kernel, but the mismatch kernel 509 allows a maximum mismatch of m nucleotides in the k-mer and the gappy pair kernel considers pairs of k-510 mers with maximum gap of length m between them. For comparison, we trained the gappy pair kernel 511 with k = 2, m = 1 and mismatch kernel with k = 5, m = 1 to compare with the 5-mer spectrum kernel. The 512 mismatch and gappy pair kernels did not significantly increase the performance (auROCs of 0.82 and 513 0.82, respectively for liver) and are less interpretable than the k-mer spectra ( Figure S2). It is possible that 514 other parameter settings could yield slightly improved performance, but the resulting models would be 515 more difficult to interpret, and optimizing performance was not the goal of our study.

532
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online  24 A typical convolutional neural network consists of convolutional layers, max-pooling layers, fully 533 connected layers, and an output layer. To define the CNN structure ( Figure S19), we first defined a 534 hyperparameter space, including a range of learning rates, number of layers, the window size of the filters 535 (neurons), and regularization strength. Next, we trained 100 human CNN models to identify liver 536 enhancers using keras [75]

548
For direct comparison to the performance of CNNs, we also trained k-mer spectrum SVM models 549 for each species on the same balanced dataset at the CNNs. We compared the performance for k from 4 to 550 8 on this balanced human dataset; we report results for k of 6 based on it giving the best average auROC 551 in ten-fold cross-validation. The performance of within-species prediction is reported based on the 552 average auROC of ten-fold cross validation and the performance of cross-species prediction is reported 553 based on the auROC of predicting all data in the testing species. 554 555 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

628
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online Feb. 21, 2017;  Starting with liver, limb and brain enhancers and genomic background regions from six mammals, the first step of the pipeline quantified each of these genomic regions by their 5-mer spectrum-the frequency of occurrence of all possible length five DNA sequence patterns. Using the spectra as features, we trained a spectrum kernel support vector machine (SVM) to distinguish enhancers from non-enhancers in each species and evaluated their performance with ten-fold cross validation. Then, we applied classifiers trained on one species to predict enhancer activity in all other species. Finally, we evaluated the performance of cross-species prediction compared to within species prediction and compared the most predictive features in classifiers from different species. Limb and brain enhancer data were only available for human, macaque, and mouse.   Figure S4, Figure S6 and Figure S7.  obtains better performance when applied to liver enhancers from other species (gray dots) than . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/110676 doi: bioRxiv preprint first posted online Feb. 21, 2017; Fig 7. CNNs identify enhancers more accurately than 6-mer-based SVM models, but generalize less well across species. (a) The auROCs of CNN models perform substantially better than the 6-mer SVM model in each species. The error bars give the standard error of tenfold cross-validation for the SVM models. (b) Neurons in the first layer of the CNN learned the motifs of important liver TFs, including HNF4A, HNF1A, and CEBPB. (c) The relative auROCs of the CNN models applied across species are consistently lower than for the 6-mer SVMs applied across the same species. This suggests that the CNN models do not generalize as well across species as the SVM models.
Supporting information S1 Appendix. Supplementary figures S1-S19 S1 Table. Liver expression of the shared TF motifs in the liver GC-controlled analysis S2 Table. The sharing of the TF motifs matched by the top 5% positive 5-mers from each classifier.
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Limb Limb e f Brain Brain . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.