Functional conservation of sequence determinants at rapidly evolving regulatory regions across mammals

Recent advances in epigenomics have made it possible to map genome-wide regulatory regions using empirical methods. Subsequent comparative epigenomic studies have revealed that regulatory regions diverge rapidly between genome of different species, and that the divergence is more pronounced in enhancers than in promoters. To understand genomic changes underlying these patterns, we investigated if we can identify specific sequence fragments that are over-enriched in regulatory regions, thus potentially contributing to regulatory functions of such regions. Here we report numerous sequence fragments that are statistically over-enriched in enhancers and promoters of different mammals (which we refer to as ‘sequence determinants’). Interestingly, the degree of statistical enrichment, which presumably is associated with the degree of regulatory impacts of the specific sequence determinant, was significantly higher for promoter sequence determinants than enhancer sequence determinants. We further used a machine learning method to construct prediction models using sequence determinants. Remarkably, prediction models constructed from one species could be used to predict regulatory regions of other species with high accuracy. This observation indicates that even though the precise locations of regulatory regions diverge rapidly during evolution, the functional potential of sequence determinants underlying regulatory sequences may be conserved between species.


Introduction
Epigenomic modifications such as histone modifications and DNA methylation play critical roles in development, regulation, and diseases.The study of epigenetic modifications has made great strides in recent decades, and the specific combinations of different epigenome components in distinct biological conditions are rapidly being discovered [1].In particular, epigenomic profiling is widely used to empirically identify regulatory regions including enhancers and promoters using chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq).For example, genomic regions enriched for histone H3 lysine 27 acetylation (H3K27ac) are considered as active enhancers [2,3].On the other hand, enrichment of histone H3 lysine 4 trimethylation (H3K4me3), in particular together with H3K27ac, indicates active promoters [4][5][6].
Beyond identifying regulatory regions, the next challenge is deciphering what factors determine and affect epigenomes.Among potential factors, the importance of cis-regulatory sequences on the epigenome is well appreciated.Several cis-regulatory sequences based predictive models have been constructed to classify regulatory regions [7][8][9][10].For example, a recent study reported random forest classifier models from the human genome that could predict regulatory regions marked by H3K27ac and H3K4me3 modifications with relatively high accuracy [11].
Even though our understanding of the true nature of the relationship between specific histone modifications and regulatory regions is sure to undergo much more revisions, these technical advances in genome-wide epigenomic profiling brought new approaches to study evolution of regulatory regions.Instead of having to rely on experimentally characterized comparative transcription factor binding assays [12][13][14] and/or regions that retain sequence similarities [15][16][17][18], enhancers and promoters can be identified based on the distribution of specific epigenomic modifications such as H3K4me3 and H3K27Ac across different species [6,19].Interestingly, these studies show that at the genome-scale, chromosomal locations of enhancers are highly divergent between species [6,20,21].Promoters are also found in divergent locations, although their positions are more constrained than enhancers, since promoters are typically adjacent to transcription units (e.g.[6]).Thus, while regulatory regions can be reliably predicted from sequences within specific genomes [7][8][9][10][11], the precise locations of regulatory regions, in particular of enhancers, diverge rapidly during evolution [6,18,20,21].
It is not necessarily straightforward to reconcile these two aspects of regulatory regions.In the simplest scenario, functional regions such as enhancers and promoters should be evolutionarily conserved since they are subject to purifying selection.Indeed, this idea has been successfully used to identify non-coding sequences with regulatory functions [16,17,22,23].
However, at the genome-scale, regulatory regions harbor little sequence similarities and their locations are highly divergent.Rapid turnover of transcription binding sites [12,24,25] and transcription rewiring [26][27][28] can explain some aspects of regulatory sequence evolution, but many questions still remain [29,30].
Here, utilizing the wealth of comparative data on epigenomically determined enhancers and promoters, we investigated whether we could identify specific sequence fragments that constitute enhancers and promoters, and if so, whether such sequence fragments were evolutionarily conserved between species.We first performed an exhaustive search to identify sequence fragments that are statistically over-represented in experimentally identified enhancers and promoters of several mammals [6].A unique aspect of our study is that we focused on distinguishing regulatory regions from nearby regions.Genomic sequences of mammals such as humans are highly heterogeneous in many aspects such as GC contents, transposable element contents, genic contents, and other aspects [31,32].By comparing regulatory regions to their nearby non-regulatory regions, we identified sequence fragments that distinguished regulatory regions from its local genomic backgrounds.Our comprehensive exhaustive search revealed numerous sequence fragments that were significantly enriched in regulatory regions compared to nearby regions.Due to the nature of the exhaustive search, some of the identified sequence fragments may be inter-related.To overcome this limitation and identify a subset of sequence fragments that are statistically independent, and to construct prediction models to test evolutionary hypotheses, we employed a machine learning method.Specifically, we used the least absolute shrinkage and selection operator (LASSO) method [33], which can effectively select one variable among the set of highly correlated variables [34].The LASSO method is also excellent at prediction accuracy [11,35].
From these procedures, we discovered numerous sequence fragments that are statistically enriched in experimentally verified regulatory regions (referred to as 'sequence determinants' henceforth).Intriguingly, sequence determinants obtained from enhancers and promoters show remarkable differences with respect to their impact on functional regions.Moreover, even though sequence determinants themselves exhibit only moderate overlaps between species, prediction models constructed using sequence determinants from different species could be inter-changed to perform as well as prediction models from the focal species.We discuss potential implications of these findings.

Enhancer and promoter data
We used experimental annotations of liver enhancers and promoters from a previous study [6].Following the definition in this study [6], we considered enhancers to be regions marked only with the H3K27ac mark and promoters to be regions marked with H3K4me3 (with or without H3K27ac).We selected data from seven 'high-quality' mammalian genomes as indicated in [6], including Home sapiens (human), Macaca mulatta (macaque), Bos taurus (cow), Sus scrofa (pig), Canis familiaris (dog), Rattus norvegicus [32], and Mus musculus (mouse).Each enhancer or promoter was designated as foreground, and a segment of the same length 100,000 base-pairs (100kb) apart from the foreground was selected as the background.We used these 'regional' backgrounds to control for potential chromosome effect and/or regional effects.The distance of 100kb between the foreground and background was selected since several genomic features such as linkage disequilibrium blocks and GC contents show correlations that extend to ~100kb [32,36].We obtained the genome sequences using the R Bioconductor libraries "BSgenome" [37].Backgrounds that had greater than 50% of nucleotides missing (not sequenced) were discarded (Table 1), and put information on overlapped proportions between foreground and background in S1 Table.

Enhancer and promoters in orthologous locations
Those enhancers and promoters found in orthologous locations across species were identified as conserved (Table 1).Specifically, for each human enhancer or promoter we retrieved the 17 eutherian EPO multiple alignment using Ensembl REST API [38] and determined if the region was conserved or not based on whether all other 6 species also showed the same histone mark (s) in the orthologous region.For species with different genome assemblies in the alignment, we converted the coordinates using Ensembl assembly converter [39].

Exhaustive search for sequence determinants
We examined whether specific sequence fragments in the foreground were over-represented compared to the backgrounds by statistical testing.We used sliding windows with a specific length (from 6-mers to 15-mers), moving from the 5' end to the 3' end in each foreground or background (Fig 1).As the window moved by a base-pair (bp), a sequence fragment within that bin was captured and recorded.Following this sliding window analysis, counts of each sequence fragment in the foreground and background were obtained.For each sequence fragment, we constructed a 2×2 contingency table that contained counts of a sequence determinant in each of foreground and background region (Table 2), and we used the odds ratio (OR) as a measure of over-representation in foreground, compared to background.The magnitude of OR indicated how strongly over-enriched a specific element was in regulatory regions, which we also referred to as 'effect size' in this study.
We used the χ 2 test to test the following null and alternative hypotheses: If the expected count of a sequence fragment in any of the cell in the 2×2 contingency table was lower than 5, we used the Fisher's exact test instead.The resulting P-values were corrected for multiple testing using the false discovery rate (FDR) approach [40].Following these procedures, a 'sequence determinant' in the statistical sense was identified as a sequence fragment whose FDR Q-value was equal to or less than 0.05 and the OR was greater than 1.In the process, we tested only sequence determinants that appeared over 100 times to avoid selecting rare sequence determinants of negligible biological relevance.For example, for 15-mers in the human enhancer data set, most sequence fragments (63 million out of 70 million) occurred only once.We repeated this procedure for each of the seven species and identified 'species sequence determinants'.(B) Using the counts from the sliding window, a χ 2 test for each sequence fragment (of each species) is conducted to determine whether that sequence fragment is significantly overrepresented in the foreground compared to the background.Significantly over-represented sequence fragments are named 'species sequence determinants'.In addition, the CMH test [34] is used to detect sequence determinants that are present in all seven species ('common sequence determinants').(C) The least absolute shrinkage and selection operator (LASSO) prediction models are constructed for each species using a subset of sequence determinants from (B).These sequence determinants are randomly selected within the stratification of GC content and fragment length.After LASSO selection, which removes some of redundant or non-significant determinants, the resulting models were applied to its own species (same-species prediction), or the other six species (inter-species prediction) to evaluate prediction performances. https://doi.org/10.1371/journal.pcbi.1006451.g001

Count of a target sequence fragment 1
Count of the other sequence fragments 1  Total count 1 Numbers in the parenthesis are expected numbers to calculate the χ 2 test statistics.For example, for a sequence fragment (e.g."AACCGGTT"), N 11 is its observed count in the foreground regions, N 12 is the observed count of sequence fragments that are not "AACCGGTT", but has the same length in the foreground regions, and N 21 and N 22 are the counterpart of the N 11 and N 12 in the background regions, respectively.The sign "+" means row-wise sums (N 1+ , N 2+ ), column-wise sums (N +1 , N +2 ), or total sum (N ++ ).OR is estimated as N 11 ×N 22 /(N 12 ×N 21 ). https://doi.org/10.1371/journal.pcbi.1006451.t002

Common sequence determinants
We identified 'common sequence determinants' as sequence fragments that are enriched in foreground regions compared to the background regions across the seven mammalian species.
For the purpose, we used the Cochran-Mantel-Haenzel (CMH) test [41] to identify enrichment of sequence determinants from multiple data sets using a conditional variable, which is a nominal covariate such as the species index [41,42].The CMH test is also equivalent to the score type test of logistic regression, which has advantages in the handling of sparse count data sets [42].Consequently, we used the CMH to test the null hypothesis, where OR jspecies is the conditional OR in presence of the species index: ð3Þ Common sequence determinants were then defined as those whose OR |species >1 for all species and FDR Q-value from CMH � 0.05.

LASSO prediction models using sequence determinants from exhaustive search
We constructed prediction models that yield predictive scores for each region.We used the least absolute shrinkage and selection operator (LASSO) method [33], which excels at prediction accuracy as well as covariate selection [11,35].In the LASSO model, each foreground or background region was regarded as a binary observation (foreground = 1, background = 0).The relative frequency of each sequence determinant was regarded as an explanatory variable.Because the space of all significant sequence determinants was extremely large (S2 and S3 Tables), including all determinants in the LASSO model was not computationally feasible.Instead, we selected 10,000 sequence determinants, sampled according to their distribution of GC content and fragment length, to incorporate in the LASSO models using a stratified sampling approach [43].Specifically, we stratified the whole sequence determinants by the combination of GC content (ten uniform intervals: [0~0.1],. .., (0.9~1.0]) and length (ten lengths: 6,. ..,15bp).Then we selected samples from each of the stratified subsets so that its number out of the 10,000 was proportional to the number of determinants in the specific subset among the total determinants.To train LASSO models and estimate coefficient of each determinant, we used the R function "glmnet" from the package "glmnet" using R 3.4.0.
To construct prediction models, we used both the 10,000 species sequence determinants and the 10,000 common sequence determinants as input variables, so that we can compare the prediction performances of species determinants and common determinants.We performed two types of predictions.First, we performed same-species prediction, which evaluates prediction AUC through a 10-fold cross-validation process [11,35,44,45].During the 10-fold cross-validation process, an optimal penalty parameter that provides the smallest test AUC is chosen.We regarded the smallest test AUC as same-species prediction AUC.For inter-species prediction, we used the optimal parameter to construct a prediction model from whole data set of a species and applied the model to the other species to calculate inter-species prediction AUCs.Workflow from the exhaustive search to LASSO is depicted in Fig 1 .In most prediction results, we provided two types of AUC, the first one is receive operating characteristic AUC (ROC-AUC) for general performance of prediction and the second one is precision-recall AUC (PR-AUC) for robustness of performance regardless of the ratio between numbers of foreground and background [46].
Among several machine-learning methods, we selected LASSO because of its ability to reduce the number of input variables so that those are not redundant and are statistically meaningful.However, other machine learning methods might be useful as well.For example, when many of sequence determinants have strong relationship in terms of correlation, elastic net that can capture more input variables would be useful to improve prediction performances [47].

Transcription factor binding sites (TFBS) analysis
We examined the presence of transcription factor binding sites (TFBS) in the sequence determinants using TOMTOM [48].This tool assesses the similarity between individual sequence input and specific TFBS databases and provides P-values and Q-values adjusted by FDR.Known TFBS compiled in the JASPAR 2014 Core vertebrate database [49], the HOCOMO-COv10_HUMAN and the HOCOMOCOv10_MOUSE [50] were used.We summarized the proportion of significant (P <0.05) TFBS hits as 'TFBS frequency'.For example, each human sequence determinant was compared to the 641 known TFBS in the HOCOMOCOv10_HU-MAN database.The number of significant comparisons out of the total 641 comparisons was referred to as 'TFBS frequency'.Due to the probabilistic nature of TF binding and the fact that sequence determinants might encode partial or full TFBS, TFBS frequency indicates versatility of a sequence determinant that can be a motif for TFBS binding.For instance, the CAGCCC determinant from the human genome yielded 18 of 641 significant hits, thus TFBS frequency of the determinant was 2.8%.We also used-log 10 min(P) instead of TFBS frequency to evaluate the best match between a k-mer and the motifs in the database.

Analysis of the relationship between biological factors of sequence determinants
Sequence determinants from the exhaustive search as well as from the LASSO prediction models were further analyzed to explore relationships between their effect sizes and several biological factors such as GC content and TFBS binding properties.For this analysis, we used the following linear model; where i is the index of each sequence determinant and ε i ~N(0,σ 2 ).In this model, we log 2 transformed the OR values to improve normality.We applied the model to enhancer and promoter sequence determinants from common, human, and mouse sets.

Promoter sequence determinants are strongly over-represented relative to enhancer sequence determinants
To identify sequence fragments that are significantly enriched in enhancers or promoters compared to nearby background regions (sequence determinants), we first performed an exhaustive search.Briefly, we examined sequence fragments of lengths from 6 to 15 bp, using a sliding window approach (Fig 1).We tested statistical over-representation of the specific sequence fragment in the enhancers or promoters compared to their backgrounds using a contingency table test based on their ORs.The P-values were adjusted via the false discovery procedure [40] (Materials and Methods).Following these procedures, we identified numerous sequence determinants associated with enhancers and promoters of each species (referred to as 'species sequence determinants', Materials and Methods).Fig 2(A) and 2(B) show the numbers of significant sequence determinants from human enhancers and promoters based on their OR and length.The majority of sequence determinants in enhancers and promoters were found in 7-11 bps.Human enhancer determinants were slightly yet significantly longer than promoter determinants (mean lengths for human enhancers and promoters were 9.20 and 9.01, P <1×10 −5 by two sample t-test).However, there was no consistent pattern across the seven mammals when comparing the length of sequence determinants in enhancers and promoters.Sequence determinants were also generally GC-rich and TFBS-rich compared to non-significant sequence fragments (see below).Remarkably, with respect to OR, sequence determinants from enhancers and promoters were highly distinct.Strongly enriched sequence determinants, such as those with OR � 2.0, were 140-fold more abundant in promoters than in enhancers (Fig 2).Accordingly, the ORs of sequence determinants were significantly higher in promoter sequence determinants than in enhancer sequence determinants (P < 10 −15 by Wilcoxon's rank sum-test in all seven species, Fig 3).
We then examined sequence determinants that occurred more frequently than expected in all seven mammalian species, which we referred to as 'common sequence determinants' (Materials and Methods).Similar to the results from the above analysis, common sequence determinants had higher ORs in promoters than in enhancers (P < 10 −15 by Wilcoxon's rank sumtest, Fig 2(C) and 2(D), S4 Table ).When we compared the entries of common sequence determinants to those of species sequence determinants, we found that 39% and 57% of all human enhancer and promoter determinants overlapped with common enhancer and promoter sequence determinants, respectively (S5 Table).Therefore, regardless of their species-wise distribution, sequence determinants that mark promoters tended to have significantly greater OR thus presumably stronger effects on regulatory potential of target regions in terms of marginal effect size, compared to those found in enhancers.

LASSO approach supports different effect sizes of enhancer and promoter sequence determinants
The exhaustive search allowed us to identify all sequence determinants that were marginally enriched.However, some sequence determinants might be highly correlated with each other, because they were extracted from overlapping regions (Fig 1).The LASSO approach is capable of selecting one variable among the highly correlated variable sets, in addition to selecting variables of substantial effect [33].Therefore, we next used the LASSO approach to select essential variables among the many correlated variables, and to construct prediction models that discriminate enhancers and promoters from their corresponding background regions (Materials and Methods).The total numbers of sequence determinants from the human enhancers and promoters were 107,287 and 101,625, respectively (S2 and S3 Tables).AUCs increased as the number of input sequence determinants increased, to stabilize around 7,000 sequence determinants (S1 Fig) .We thus chose 10,000 sequence determinants for each set of sequence determinants using a stratified sampling approach [43], to select a subset that is representative of the original distribution with respect to GC contents and lengths (Materials and Methods).Following these steps, prediction models were constructed for both same-species prediction and inter-species prediction.
We investigated the distribution of ORs and the lengths of selected sequence determinants from the LASSO approach ('LASSO-selected sequence determinants'), and from same-species prediction.The same-species prediction model of human enhancers and promoters had a total of 4321 and 1343 LASSO selected sequence determinants, respectively (S6 and S7 Table ).Consistent with the results from the exhaustive search, marginal ORs from the enhancer models were significantly lower than those from the promoter models in all species (Wilcoxon test, P < 10 −15 , S2 Fig).We investigated the relative frequencies of individual LASSO-selected sequence determinants in foreground and background regions, shown as density plots in S3 Fig.In promoters, marginal density of the relative frequencies of LASSO-selected sequence determinants is highly distinct from that of the background, which is consistent with the high effect size of LASSOselected promoter sequence determinants.On the other hand, marginal densities of LASSOselected enhancer sequence determinants are similar to those in the background.This observation indicates that in addition to having weaker marginal effects than promoter sequence determinants, the frequency distribution of enhancer sequence determinants is similar between foreground and background.
Interestingly, LASSO selected sequence determinants were significantly longer for enhancers than for promoters (mean lengths of 9.22 in enhancers and 8.32 in promoters in human, P <1×10 −15 by two sample t-test, S6 and S7 Table ).This pattern was consistent in other species (P < 10 −15 by two-sample t-test in all cases).When we applied LASSO approach to 10,000 common sequence determinants, we observed similarly significant differences of effect size and length between enhancer and promoter sequence determinants (S2 and S4 Figs).

Distinctive effects of GC content and TFBS frequency on enhancer and promoter sequence determinants
We examined two aspects of sequence determinants to understand what features affect enhancer and promoter potentials of specific sequence fragments.Specifically, we used a linear model to analyze the effect of the frequency of G and C nucleotides (GC content) and the frequency of transcription factor binding sites (TFBS frequency).The effect sizes of sequence determinants were response variables, and GC content, TFBS frequency, and their interaction term were explanatory variables.When we analyzed the results of the LASSO-selected sequence determinants, several patterns became clear.First, this model explained a large amount of variation observed in promoter sequence determinants, but only a modest portion of those in enhancer sequence determinants (Table 3).Nevertheless, we found that main factors of GC content and TFBS frequency were positively correlated with the log 2 -transformed OR of sequence determinants both in enhancers and promoters (Table 3, S8 and S9 Tables).However, interaction terms between the two main factors were significantly negative only in promoters.Thus, while GC content and TFBS frequency worked additively to determine the strength of regulatory potential for enhancer sequence determinants, these two factors were antagonistic with each other in promoter sequence determinants (Fig 4(A) and 4(B)).This observation is consistent with previous studies that found a lack of transcription factor binding enrichment at GC-rich promoters compared to GC-poor promoters [51].We also evaluatedlog 10 min(P) instead of TFBS frequency to evaluate the best match between a k-mer and the motifs in the database, and obtained highly similar results for the same models (S10 Table ).In summary, TFBS frequency was positively correlated with effect size in both of enhancer and promoters when GC content was low.On the other hand, the estimated coefficients of GC content and TFBS frequency were higher in promoters than in enhancers, indicating that the effects of these factors were stronger in promoters compared to in enhancers.Accordingly, the R 2 of the linear models were substantially higher for promoters than for enhancers (Table 3, S8 and S9 Table ).Second, the relationships between GC contents and TFBS frequency were negative in both of enhancer and promoter analysis (Fig 4(C) and 4(D)).Accordingly, sequence determinants that were GC-rich tended to lack TFBS, and low GC sequence determinants tended to harbor more TFBS than high GC sequences [51].The whole set of sequence determinants obtained from exhaustive search yielded similar results (S11 Table ).

LASSO prediction models can be inter-changed between species
The prediction accuracy of the human promoter same-species prediction model was very high, with an AUC of 0.97 (Fig 5).Same-species prediction models from other six species exhibited similarly high AUCs (S12 and S13 Table ), indicating that promoters can be accurately predicted from sequence determinants.We also evaluated prediction AUCs using 10,000 non-sequence determinants, while matching the distributions of GC content and length as those of sequence determinants.We then constructed prediction models using LASSO for enhancers and promoters in human and mouse, respectively.We iterated the process five times to measure variability of the AUCs.Results are shown in S6 Fig.The AUCs of models using non-sequence determinants were lower than AUCs with sequence determinants.For example, human and mouse enhancer prediction AUCs with non-sequence determinants showed 0.507 and 0.002, and 0.500 and 0.007 for mean and standard deviation, respectively.These results indicate that non-sequence determinants had poor prediction performances.In case of promoters, the mean and standard deviation of AUCs were 0.636 and 0.006 for human, and 0.608 and 0.004 for mouse, respectively.These values were higher than those of enhancers, likely reflecting the effect of GC contents (e.g., [52]).Nevertheless, they were substantially lower than the AUCs with sequence determinants, indicating that sequence determinants have superior prediction performances than non-sequence determinants.
Next, we tested if prediction models constructed from one species could be used in different species, to investigate if different genomes use similar sequence determinants to encode promoters.Indeed, when we calculated AUCs of inter-species prediction between seven species of promoters, the AUCs were all above 0.9, indicating high accuracy (Fig 5).
On the other hand, the LASSO prediction models of enhancers had the following differences from those of promoters.First, the enhancer models using 10,000 species determinants had 2.5-to 4.2-fold greater numbers of explanatory variables than the promoter models (S12 Table ).However, their AUCs were generally lower than those of the promoter models (Fig 5).We found that same-species prediction AUCs for enhancer models were greater than 0.7, and the highest was when mouse model were used to predict mouse enhancers, 0.76 (S12 Table ).Nevertheless, inter-species prediction results using enhancer models showed similar AUCs to same-species enhancer predictions (Fig 5).
We tested if the high inter-species prediction accuracies were driven by the presence of highly conserved regulatory elements across different mammalian species.The proportions of conserved enhancer regions among the seven species were much smaller than those of promoter regions, as previously described [6] (Table 1).Interestingly, we observed similar AUCs before and after removing highly conserved regulatory regions at both enhancers and promoters (S14 Table ), suggesting that conserved regulatory regions were not responsible for the high predictabilities across species.We then extracted 10 subsets of 10,000 sequence determinants from human enhancer and promoter sequence determinants (all subsets were mutually exclusive with each other subset) and constructed LASSO models to apply to the samespecies (human) prediction and inter-species (mouse enhancer) prediction.We found that the AUCs of these 10 subsets were highly similar (S6 Fig) .Thus, even though the regulatory regions themselves were not conserved in terms of their precise location, mammalian enhancers and promoters have inter-changeability in terms of prediction between species.We also constructed LASSO models using 10,000 common sequence determinants from all seven species.AUC values for promoter prediction were highly similar to those obtained from models using species sequence determinants (Fig 5,S13 Table), indicating that sequence fragments that were commonly enriched in all 7 species harbor sufficient signals for promoter prediction.On the other hand, enhancer prediction results using 10,000 common sequence determinants showed slight decrease of AUC compared to same-species prediction (mean AUC of prediction with species determinants: 0.723, that with common determinants: 0.679).Interestingly, mean numbers of LASSO selected common sequence determinants were significantly lower than the species ones in enhancers (1613 and 3970 for common sequence determinants and species sequence determinants, respectively; P <1×10 −5 by paired t-test), while they were not significantly different in promoter models (1138 and 1342 for common sequence determinants and species sequence determinants, respectively; P = 0.2114 by paired t-test).This implies that each of the common enhancer sequence determinants may have higher predictive capabilities than species sequence determinants.

Impact of sequence composition and prediction performance
While background and foreground of enhancers exhibit similar GC distribution, foreground regions of promoters are substantially skewed towards GC-rich regions (the average difference was 10.0%, higher in promoters than in enhancers) (S7 Fig) .Therefore, we investigated how GC content difference between foreground and background might affect prediction analyses.First, to measure the impact of GC content alone in prediction performances, we calculated AUCs using only GC content as a predictor (Table 4).Second, we constructed LASSO models using sequence determinants of low-GC content (GC content �0.5) to measure prediction performances without effects of high GC content sequence determinants.For this analysis, we randomly selected 10,000 sequence determinants with stratification of GC content and sequence length.These results were then compared to those of the original AUCs.
We found the AUCs using only GC content reflected the amount of GC content differences between foreground and background (S7 Fig) .For example, average AUCs using only GC content were 0.589 and 0.837 in enhancers and promoters, respectively.However, both of those AUCs were considerably lower than the original AUCs (differences of 0.134 and 0.107 in enhancers and promoters, respectively), meaning that GC content could not explain all of the variation between foreground and background.This observation is consistent with a prior study utilizing a similar approach [52].Moreover, models with low-GC sequence determinants had higher AUCs than those using only GC contents.In other words, models without high GC content sequence determinants outperformed the AUCs with only GC contents.Interestingly, mean AUCs with low-GC sequence determinants in enhancers were even higher than those of the original AUCs, which may imply that low-GC enhancers sequence determinants had better prediction performances than high-GC sequence determinants when they were jointly used for prediction.In conclusion, prediction performances of the sequence determinants detected by LASSO cannot be attributed to their GC contents.

Discussion
Understanding specific histone modifications marking enhancers and promoters has opened the way to identify these regions using ChIP-seq, which complements and scales up traditional transcription factor binding assays [1,6,53].Even though our understanding of the exact molecular nature of regulatory regions continues to improve, technical advances in epigenomic assays have opened a new opportunity to study evolution of regulatory regions using unbiased genome-wide epigenomic profiling.We were motivated by two observations: that regulatory regions identified from epigenomic assays can be predicted with high accuracy in case of same-species prediction [7][8][9][10][11], yet that they are highly divergent between different species [6,18,20,21].The fact that regulatory regions can be predicted with high accuracy implies that specific sequence fragments can encode regulatory function.Indeed, previous studies often referred to such fragments as cis-regulatory motifs.Since they encode function, they are likely to be subject to natural selection (largely purifying selection) and thus evolutionarily conserved.However, genome-wide studies indicate that regulatory regions, especially enhancers, are highly divergent between species.To investigate this potentially paradoxical pattern of evolution of regulatory regions, we used a powerful approach to examine every possible sequence fragments for their statistical enrichment in experimentally verified enhancers and promoters of seven mammalian species.This approach, which we named exhaustive search, revealed that numerous sequence fragments were statistically overrepresented in enhancers and promoters (which we named as sequence determinants).Sequence determinants underlying enhancers and promoters exhibited intriguing differences with respect to their degree of enrichment (effect size), GC content, and the frequencies of known TFBS.Notably, the degree of statistical enrichment was significantly higher for promoter sequence determinants compared to enhancer sequence determinants.This observation suggests that sequence determinants may have greater impacts on the regulatory potential of promoters than of enhancers.This idea is also consistent with the fact that promoters are more evolutionarily conserved than enhancers [6].
We next applied a machine-learning method, LASSO, to reduce interdependence among sequence determinants and construct prediction models based on the non-redundant sequence determinant set.Same-species prediction models generated from these sequence determinants had high AUCs for enhancers and promoters (Fig 5 and S12 and S13 Tables), affirming the predictor power of sequence determinants [11,52].The AUCs from these models are on par with those from previous studies that utilized different approaches (e.g., [11]).We observed that enhancer models utilized greater numbers of predictors yet exhibited lower accuracy compared to promoter models, which can be explained by promoter sequence determinants associated with significantly higher effect sizes compared to enhancer sequence determinants (Figs 2 and 3, S2 and S4 Figs).Furthermore, we applied prediction models generated from one mammal to other mammals, to directly test whether sequence determinants from one species could be used to predict regulatory regions in other species.Remarkably, even though the sequence determinants themselves had only moderate overlaps between species (S5 Table ), models constructed from one species could predict promoters in other species with high accuracies (S12 and S13 Tables).As for enhancer models, AUCs from inter-species prediction models were also comparable to same-species predictions (Fig 5).In other words, the extent to which prediction models could be inter-changed between species was similar between enhancers and promoters (Fig 5).
We used a cutoff effect size for sequence determinants as 1, for the following reasons.First, many sequence determinants have extremely low p-values despite low effect sizes due to their abundance, especially those with shorter lengths.For example, 25% of human enhancer sequence determinants among those of top 10,000 lowest p-values have effect sizes smaller than 1.2.Second, when we constructed a human enhancer prediction model using randomly selected 10,000 sequence determinants with effect sizes smaller than 1.2, the resulting AUC was 0.715, which is equivalent to the original AUC.Moreover, when we applied this model to mouse, the inter-species AUC was 0.680, even higher than the original AUC (0.647).Therefore, setting an arbitrary cutoff value is likely to result in the loss of true sequence determinants that are important in terms of prediction performances.
Integrating the main findings that 1) there are a large number of sequence determinants that potentially contribute to the regulatory roles of enhancers and promoters; 2) the strength of statistical enrichment of sequence determinant is greater for promoters, which are more evolutionarily conserved than enhancers; 3) prediction accuracies of models generated using sequence determinants from different species are comparable to each other, we hypothesize the following.Even though the specific motifs that encode regulatory regions are different between species [6,18,20,21], the function of specific sequence determinants could be conserved between species.There may exist a large reservoir of potential sequence determinants that can contribute to regulatory regions of many species.

Fig 1 .
Fig 1.Overall workflow for the exhaustive search and construction of prediction models.(A) Each sequence fragment is counted and summarized in a k-mer sliding window for foreground (regulatory regions) and background (control).(B) Using the counts from the sliding window, a χ 2 test for each sequence fragment (of each species) is conducted to determine whether that sequence fragment is significantly overrepresented in the foreground compared to the background.Significantly over-represented sequence fragments are named 'species sequence determinants'.In addition, the CMH test[34] is used to detect sequence determinants that are present in all seven species ('common sequence determinants').(C) The least absolute shrinkage and selection operator (LASSO) prediction models are constructed for each species using a subset of sequence determinants from (B).These sequence determinants are randomly selected within the stratification of GC content and fragment length.After LASSO selection, which removes some of redundant or non-significant determinants, the resulting models were applied to its own species (same-species prediction), or the other six species (inter-species prediction) to evaluate prediction performances.

Fig 2 .
Fig 2. Heatmaps of the sequence determinant counts according to sequence length and OR from the exhaustive search.The X-axis corresponds to the length of sequence determinants and Y-axis to the OR of each sequence determinant.(A) and (B) are from human species sequence determinants while (C) and (D) are from common sequence determinants.The count information of species sequence determinants in the other six species are summarized in S2 and S3 Tables.OR in the heatmaps for the common sequence determinants represents the minimum OR value among all seven individual ORs.The total numbers of sequence determinants are 107,287 in (A), 101,625 in (B), 69,503 in (C), 59,783 in (D).https://doi.org/10.1371/journal.pcbi.1006451.g002

Fig 3 .
Fig 3. Contrasting odds ratio distributions of enhancer and promoter sequence determinants using boxplots.Sequence determinants from promoters have significantly higher Odds Ratio than those from enhancers in all seven species (P < 10 −15 by Wilcoxon's rank sum-test in all species).https://doi.org/10.1371/journal.pcbi.1006451.g003

Fig 4 .
Fig 4. Relationships between several biological variables using LASSO selected sequence determinants from human.First, we present the results from our linear model in Table 3 ((A): Enhancer model results, (B): Promoter model results).To demonstrate the relationship between log 2 OR and TFBS frequency, we regressed out GC content on log 2 OR and drew scatterplots between the resulting partial residual of log 2 OR and TFBS frequency.We further separated the points into two groups, above and below 0.5 GC content.As seen in Fig4(A) and 4(B), a clear interaction effect was detected only in the promoter model, and TFBS frequency for low GC content is positively correlated with log 2 OR in both models, although the positive correlation is clearer in the promoter model (R 2 : 0.012 and 0.021, P-value: 2.7×10 −5 and 0.026, for enhancer and promoter, respectively).In Figs4(C) and 4(D), the negative relationships between GC content and TFBS frequency in enhancers and promoters are depicted in comparison to the background.The green and blue points are results from LASSO selected sequence determinants, while the gray points are control data sets consisting of randomly selected sequence fragments that are not sequence determinants.https://doi.org/10.1371/journal.pcbi.1006451.g004

Fig 5 .
Fig 5. Prediction accuracies of LASSO models as measured by AUCs.(A) ROC curves of the human same-prediction result from ten-fold cross validation.Solid lines represent prediction with human species sequence determinants, while dashed lines represent those with common sequence determinants.(B) Cross-species prediction AUCs based on human prediction models.The Y-axis represents relative AUC value calculated as the ratio between cross-species prediction AUC and same-species prediction AUC based on the human models constructed using human species sequence determinants (circle or cross marks) and common sequence determinants (triangle or "x" marks), respectively.(C) PR curves of the human same-prediction result from ten-fold cross validation.(D) Cross-species prediction results PR-AUCs based on human prediction models.https://doi.org/10.1371/journal.pcbi.1006451.g005

Table 3 . Linear model results of log 2 OR ~GC content + TFBS frequency + GC content × TFBS frequency + ε.
We used LASSO-selected species sequence determinants for these analyses.NS indicates that the interaction terms were not statistically significant at P = 0.05.In such cases we conducted log 2 OR ~GC content + TFBS frequency + ε model instead of the original model.Numbers of sequence determinants, R 2 values of the models, and Type III partial sum of square in regression (SSR) for each variable are also provided.https://doi.org/10.1371/journal.pcbi.1006451.t003