Conserved and divergent signals in 5’ splice site sequences across fungi, metazoa and plants

In eukaryotic organisms the ensemble of 5’ splice site sequences reflects the balance between natural nucleotide variability and minimal molecular constraints necessary to ensure splicing fidelity. This compromise shapes the underlying statistical patterns in the composition of donor splice site sequences. The scope of this study was to mine conserved and divergent signals in the composition of 5’ splice site sequences. Because 5’ donor sequences are a major cue for proper recognition of splice sites, we reasoned that statistical regularities in their composition could reflect the biological functionality and evolutionary history associated with splicing mechanisms. Results: We considered a regularized maximum entropy modeling framework to mine for non-trivial two-site correlations in donor sequence datasets corresponding to 30 different eukaryotes. For each analyzed species, we identified minimal sets of two-site coupling patterns that were able to replicate, at a given regularization level, the observed one-site and two-site frequencies in donor sequences. By performing a systematic and comparative analysis of 5’splice sites we showed that lineage information could be traced from joint di-nucleotide probabilities. We were able to identify characteristic two-site coupling patterns for plants and animals, and propose that they may echo differences in splicing regulation previously reported between these groups.


A Maximum Entropy Models
Maximum entropy models are based on information theory principles [1], and are applied to the problem of inferring joint probability functions P (x) satisfying constraints f obs ν related to ν = 1, ..., M experimental observations.An estimator P for the sought probability distribution is obtained by maximizing the entropy functional S, which is defined in Equation 1: The first term expresses the Shannon entropy for the distribution, while the second term enforces a normalization constraint on P (x).The third term carries the influence of the deviations between observed frequencies f obs ν (be they singlesite, two-site or otherwise) compared to ⟨f ν ⟩ P , their modelled counterparts as estimated from the modelled probability function P (x).Lagrange multipliers λ ν weigh different contributions for the constraint corresponding to different observed frequencies (λ 0 merely weighs normalization).
The purview of this study was to estimate P ( ⃗ S) in regard to an ensemble of 5 ′ splicing sites sequences { ⃗ S}.This ensemble consists of separate L = 9base long sequences as explained in the main manuscript, referred to as ⃗ S = (s −3 , s −2 , s −1 , s 1 , s 2 , s 3 , s 4 , s 5 , s 6 ) where s i ∈ {A, C, G, T }.The observed nucleotidic apparition frequencies at each site f i (s i ), and the two-site correlations between bases f ij (s i , s j ) are used as model constraints to infer P ( ⃗ S).These observations represent 36 1-site and 576 two-site frequencies, for a total of M = 612 fitting parameters in the probabilistic model.The observed frequencies in 1 can thus be expressed as {f obs ν } ={f −3 (A), f −3 (C), ..., f i (s i ), ...f +6 (T ), f −3,−2 (A, A), ..., f ij (s i , s j ), ..., f +5,+6 (T, T )} (2) so that the entropy functional can be rewritten as shown in Equation 3, where h i (s i ) and J ij (s i , s j ) denote the one-site and two-site fitting parameters, respectively: It can be appreciated that the entropy functional in Equation 3 includes terms that penalize deviations between the observed frequencies and the probabilities obtained from the model P ( ⃗ S).The Lagrange multipliers, h i (s i ) and J ij (s i , s j ), weigh the contribution of these deviations in the overall entropy.The goal is to find the values of these fitting parameters that maximize the entropy functional, resulting in the estimated probability distribution P ( ⃗ S).Optimizing this functional leads to the following expression: After taking into account the normalization condition: with where Z = ⃗ S e −E d ( ⃗ S) is a normalization constant.After estimating the fitting parameters (see Section B of the supplementary material) the estimated P should fulfill:

B Fitting procedure
The technique of gradient descent was used to fit 612 the parameters which constitute our model.Each iteration generated an ensemble of 100000 sequences via a Metropolis-Hastings rejection sampling scheme compatible with the probability distribution function described by Eq. 5, using the current model parameters.The iteration then updated said parameters according to the following rules: If J t ij = 0: The regularization parameter γ embodies the role of a threshold preventing weaker influences from creating nonzero coupling parameters, and extinguishing coupling parameters when their associated two-site probabilites are weak enough.

C The statistical model
A family of models was fitted for each analyzed species.As the regularization parameter γ was decreased, models increased their complexity in a controlled fashion.This strategy thus yielded minimal sets of coupling constants J ij necessary to adjust the observed two-site frequencies f ij at a given accuracy level.
as a function of the number of non-zero coupling constants identified at different regularization levels (different point shape, γ ∈ [0.01, 0.05]) for human donor sequence models.Inset: coupling constants J ij (s i , s j ) as a function of observed two-site correlations σ nat i,j for the γ = 0.045 model.Right panel: r 2 correlation coefficient between modeled and observed two-site correlations as a function of the regularization parameter γ.Insets: relationship between modeled, σ model i,j and observed, σ nat i,j , two-site correlations at γ = 0.045 and γ = 0.02.

The left panel of Fig
an indicator of the deviation between two-site observed frequencies and their respective estimated probabilities, as a function of the number of non-zero coupling parameters obtained by the model at a given γ regularization level.We can observe that a weaker regularization results in a more complex model that better fits target observed frequencies.The inset shows coupling parameter values J ij as a function of the corresponding correlations observed in natural sequences for a specific regularization value (γ = 0.045).Only six non-trivial parameters are contained by this specific model, and they correspond to the largest two-site correlations present in the data.
At every regularization level the fitted models were able to generate an ensemble of sequences presenting one-site P i and two-site P ij which accurately reproduced observed one-site and two-site frequencies (R-squared > 0.99).In the right panel of Fig A we can visualize the coefficient of determination (R-squared) for the regression of model-estimated against observed correlation values, as a function of the regularization level employed.As the algorithm allows more couplings to emerge (weaker regularization), models more accurately reproduce two-site correlations present in natural sequences.A similar behavior surfaces for all analyzed genomes (see B). From said figure we may notice that models obtained at a regularization level of γ = 0.025 presented around ten coupling parameters different from zero, and achieved a noticeable reduction in ∆.On the other hand, a leveling-off in ∆ was consistently observed at γ = 0.015.This implies diminishing returns within this range.To better understand the roles and patterns of the fitting parameters {h i } and {J ij }, Fig C shows a graphical summary of the γ = 0.025 model for human donor sequences.The left panel shows a relationship between estimated and observed one-site and two-site frequencies.Correlation is easily observed between natural frequencies and those produced by the fitted model (R-squared > 0.99 for both one-site and two-site probabilities).An inset in this panel shows the relationship between modeled and observed correlations.At a given level of regularization, said correlations were somewhat harder to reproduce (R-squared∼ 0.84).The fitting procedure only considers up to level-two marginal probabilities (i.e.f i and f ij ), and yet we found that the inclusion of a finite set of pairwise couplings J ij allowed this model to reproduce higher-order statistics.
Three-site probabilities (f ijk , gray dots in FigC) were adequately captured in the simulated ensemble of sequences (R-squared = 0.95).Deviations from this curve are due to the indirect influence of two-site interactions, as the simple logarithmic relationship corresponds to a model with independent sites.As two-site coupling parameters emerge, singlesite parameters must adjust in order to preserve single-site statistics while also accomodating two-site observed frequencies.Positive deviations occur whenever the site is involved in negative interactions, and vice-versa.For instance, the single-site parameters h 5 (G) and h 6 (T ) present larger values than would be expected from a single-site model; this is caused by a positive interaction between them, stabilizing consensus G and T bases at the two last intron sites.
The inset in this panel shows a circos graphical representation of the model fitted in this example.The outer ring summarizes single-site statistics: each separate color represents a given site on sequences (cool colors represent six intronic sites, whereas warm colors correspond to the exon).Each site is divided into four boxes: their relative sizes represent the occurrence of {A, C, G, T } bases in single-site sequence statistics.Positive and negative couplings between different site-base occurrences are depicted with green and red lines respectively.

D Model validation
In order to validate our model, a good approach would be to use it to recover a-priori known statistical interactions between site-base occurrences.Unfortunately, no literature studies provide a reliable set of ground-truth interactions available for validation tasks.The following subsection will show the alternative: a simulation-based framework constructed in order to test the ability of our maximum-entropy models to uncover known significant statistical patterns embedded in sequence composition.Afterwards, the next subsection presents a cross-validation analysis as well as an assessment of robustness against noisy data.

D.1 Analysis of simulated ensemble of sequences
An important part of model validation was to corroborate the recovery of known parameters from data.In order to evaluate this, we generated an artificial ensemble of sequences using the parameters obtained by the γ = 0.025 model for metazoan organisms (middle column of Fig 4 in the main manuscript).Said ensemble was produced through a generative process based on Bayesian Networks (BN) in such a way as to maintain compatibility with one-site (36 parameters) and two-site (7 pairwise interactions) statistical patterns.BNs are graphical representations of joint probability distributions that use directed graphs to explicitly encode a set of statistical dependence assertions among random variables.This case concerned 9 random variables, each associated with a position of the donor sequence.
The first row of Fig D shows the undirected graph representing the statistical interactions to be embedded in simulated sequences, along with corresponding one-site marginal probabilities.We can notice a large connected component of seven nodes, and two more isolated nodes.
The BN shown in the middle row was used to generate nucleotidic sequences according to the following Monte Carlo process: First, a base for site −3 was chosen at random following the corresponding one-site probabilities, P (−3).Then, the base for site −2 was chosen according to a conditional one-site probability, P (−2| − 3), that depended on the realization of the random choice that took place at the precedent step for site −3.In this case, if a T nucleotide was obtained at position −3, the probability of generating an A at position −2 was penalized by a factor ϵ to account for the negative interaction between these two sites.This Monte Carlo procedure continued randomly choosing nucleotides for sites −1 and 5 (with probabilities conditioned on the outcome of site −2), sites 4 and 3 (conditioned by the random occurrence that took place at site 5) and then site 6 (which depended on the results obtained at sites −1 and 5).Finally, bases were assigned to the isolated (i.e.statistically independent) nodes at sites 1 and 2, according to their corresponding one-site probabilities.Sequences generated with this procedure were compatible with the joint probability distribution shown by the middle panel figure.With these tools at hand, we simulated an ensemble of 500000 sequences by repeatedly choosing one of these BNs at random and following the associated procedure to sample the corresponding nucleotides.A γ − 0.025 maximum entropy model identified 3 positive and 2 negative coupling parameters that matched the pairwise correlation structure embedded in the simulated sequences.Reducing the regularization level (i.e.considering a γ − 0.01 model) resulted in the identification of 4 positive and 4 negative coupling constants.
To better understand the performance of our model strategy, Fig E shows the ranking of the top 200 pairwise correlations present in the simulated dataset.Positive coupling constants identified by our models are shown in green, while negative ones are displayed in red.Black circles surround those identified by a γ = 0.025 model, while uncircled colored points indicate those identified by the γ = 0.01 model.Small, uncolored circles show correlations with no interaction parameter obtained by the model.The goal of our validation methodology was to recover the seven interactions labeled in black letters.The more strongly regularized model at γ = 0.025 recovered five parameters within this group, and only those parameters.The more weakly regularized model at γ = 0.01 recovered all seven target interactions, but also recovered a greyed-out interaction between the 3A and 5G occurrences.This shows that as regularization is relaxed, our models can accurately and progressively recover statistical information embedded in sequence ensembles. Comparing

D.2 Cross validation
Cross-validation was performed using the human donor sequence ensemble as a basis.Said sequences were placed into ten equally-sized bins according to their appearance frequency within the genome.We then randomly sampled 10% of sequences from each bin in order to construct a representative validation set.Using the remaining 90% of sequences as a training set, a model was fitted using a regularization level of γ = 0.025.Fig F displays the relationship between the frequency of validation sequences and their estimated energy values, E 90% , calculated using parameters fitted by the training dataset exclusively.An inset shows the relationship between E 90% and E 100% , the sequence energies when using parameters fitted to a complete dataset.We can observe that estimated energies, E 90% , for unseen validation sequences closely follows the decreasing relationship expected in regard to appearance frequency.As the inset shows, energy calculation is largely unaffected when the model is trained with incomplete data.

D.3 Robustness against noise
In order to generate a progressively noisy dataset, our basis was the γ = 0.025 generative model obtained for H. sapiens, containing 11 non-null two-site interaction parameters J ij (s i , s j ).This was used to generate an ensemble of 100000 9nt sequences.Five different additional ensembles were constructed by producing, on average, a random point mutation each 100, 20, 10, 5, and 2 sequences respectively.Each noisy ensemble was used to fit a separate model, and said models were assessed by comparing the newly obtained parameters with those originally found for H. sapiens.were incrementally degraded by increasing noise levels.It is evident that the pairwise correlations for -1G:4A and 4C:5G had the lowest values in the original dataset and were particularly sensitive to said noise.These signals severely weakened in the noisy sequence ensembles: this caused the corresponding parameters to not appear in their regularized fitted models.
In conclusion, based on our analyses of gold-standard retrieval, cross-validation, and robustness, we believe that our modeling approach effectively, accurately and robustly uncovers pairwise statistical couplings embedded in the training datasets.

E Analysis of 3'ss
Each annotated human donor splicing site has its counterpart: an acceptor site on the other side of its corresponding intron.In order to analyze possible statistical interactions between the donor and acceptor site, we concatenated each 9-nt donor sequence with a 13-nt long sequence from its corresponding acceptor site (12 intronic and 1 exonic sites).We thus constructed 22-nt long sequences to analyze through our maximum entropy model fitting strategy, training a γ = 0.025 model with this ensemble.Through its obtained parameters were able to characterize interactions within acceptor and donor sites, and a lack of interactions between them.
The circos picture included in Fig H summarizes the detected two-site coupling patterns for this extended dataset.Our analysis did not find any statistical evidence supporting a connection between 5'ss donor sites and 3'ss acceptor sites.The detected coupling pattern within donor sites was almost identical to those originally reported in the manuscript, showing further consistency in our approach.These findings validate the strategy used to uncover biologically relevant statistical patterns from 5'ss.Two-site coupling patterns detected for the ensemble of extended splicing sequences composed by 9 (3 exonic and 6 intronic) donor sites, followed by 13 (12 intronic and 1 exonic) acceptor sites.Each box represents the observed frequency of a given nucleotide at a given site.Exonic and intronic sites were colored using warm and cold color palettes respectively.Red and green curves represent negative and positive interactions between site-base combinations respectively.The gray line visually separates donor and acceptor sites, and shows that no interactions cross between the 5'ss and 3'ss regions.

F Energy landscape
To further characterize the energy estimated for donor sequences, we analyzed the γ = 0.025 model for human donor sequences through a network graph approach.Representing individual sequences with nodes, two were connected if and only if their sequences only differed by one mutation (i.e. a Hamming distance of 1).These edges were directed in a single way, going from the higherenergy node to its lower-energy counterpart.This procedure constructed an energy landscape in the shape of a network, where one sequence could lead towards another, but only by existing at a distance of one mutation and by undergoing an energetically convenient transition.
In this directed network, we found 243 sequences corresponding to local energy minima; this is to say, these nodes displayed no outwards edges (i.e.null out-degree), and thus were terminal states for all transitions.For each one of these attractor-nodes we estimated the inward connected component, i.e. the set of nodes that could reach this given local minimum-energy node in a finite number of steps.These sets of nodes describe the extent of the basin of attraction of a given attractor sequence.
The first violin in Fig I corresponds to a huge main basis of attraction, which contained 9593 sequences.This number represented 80% of the complete set of 5' exon-intron boundaries of the H. sapiens genome, and 96% of the entire set of GT 5'ss.The attractor's minimum-energy sequence ⃗ S * = {C, A, G, G, T, A, A, G, T } is the perfect matching sequence of U1 smRNA and presents an energy value E d ( ⃗ S * ) = 3.5.We can see from the figure that the basins of attraction of the rest of the local minima presented much higher energy values in far shallower energy wells.Among these secondary local minima, there were only 14 attractors presenting GT as first intron nucleotides.
These results suggest that, in terms of the modelled data-driven energy, our approach presents a well-defined global minimum state laying inside a wide energy well.

G Fowlkes-Mallows analysis
Given two different partitions of n objects, and being m i,j the number of common elements between the i-th and j-th clusters of the first and second partitions respectively, the Fowlkes-Mallows index is defined as: and 0 ≤ B k ≤ 1.A higher value for the Fowlkes-Mallows index indicates a greater similarity between the clusters of the two compared partitions.
K-cut partitions were produced for phylogenetics and P ij -induced dendrograms.For each k value the Fowlkes-Mallows clustering similarity index, B k , was estimated along with 10000 null-model values obtained by label shuffling.
As can be seen from the figure, Fowlkes-Mallows indices were found to be statistically significant (p v < 10 −4 ) for almost the entire range of k-cluster groups (2 < k < 29).

FigA
Fig A Role of regularization.Left panel: maximum absolute deviation between observed and estimated two site probabilities, ∆ = max[abs(f ij −P ij )],as a function of the number of non-zero coupling constants identified at different regularization levels (different point shape, γ ∈ [0.01, 0.05]) for human donor sequence models.Inset: coupling constants J ij (s i , s j ) as a function of observed two-site correlations σ nat i,j for the γ = 0.045 model.Right panel: r 2 correlation coefficient between modeled and observed two-site correlations as a function of the regularization parameter γ.Insets: relationship between modeled, σ model

FigB
Fig B Model convergence.Maximum absolute deviation between observed and estimated two site probabilities, ∆ = max[abs(f ij − P ij )], as a function of the number of the non-zero coupling parameters identified at different regularization levels (different colors, γ ∈ [0.01, 0.05]).Results for different analyzed organisms are depicted with different point shapes (see legend).Shaded symbols highlight γ = 0.025 (blue) and γ = 0.015 (red) models respectively.

FigC
Fig C Model frequencies and fitted parameters at γ = 0.025.Left panel: one-site, two-site and three-site model estimated frequencies (p i , p ij , p ijk ) as a function of their corresponding observed frequencies (f i , f ij , f ijk ), depicted as red, black and gray dots respectively.Inset: relationship between modelestimated, σ model ij , and observed, σ nat ij two-site correlations.Right panel: h i (s i ) fitting values as a function of their observed one-site frequencies, f i (s i ).The continuous gray line depicts the logarithmic relationship expected by the independent site model, h indep i(s i ) ∼ log(f i (s i )).Inset: circos representation of fitted model.A ring of 36 boxes shows nine sites in different colors, and four boxes within each to represent the four possible nucleotide occurrences A, C, G, T .Warm colors were used for the three exonic sites whereas cold colors for intronic ones.The area of each box is proportional to the observed nucleotide-site probability f i (s i ).Positive and negative couplings between sitebase occurrences are depicted with green and red lines respectively.

FigD
Fig D Bayesian network-based generative procedure.The embedded interaction pattern used to validate our model is shown in the first row.Green and red arcs correspond to positive and negative interactions respectively.In the middle and bottom rows we show Bayesian networks used to generate sequences starting from sites −3 and −1 respectively.
The bottom panel of Fig Ddisplaysanother BN that could have also been considered to generate random sequences.In this case the Monte Carlo procedure begins at site -1 and then sequentially follows the directed acyclic graph shown in the figure.Overall, there existed seven different BNs that could be used to generate random sequences, all of them compatible with the interaction pattern represented in the first row of the figure.Each one of them starts from one of the seven different nodes belonging to the largest component of the interaction graph.
Fig E with the top panel of Fig D we may verify that our simulation strategy successfully introduced the desired correlation patterns in the sequence ensemble, while the model validation strategy successfully recovered the pairwise correlation structure embedded in the data.

FigE
Fig E Gold-standard interaction recovery.Top 200 ranked pairwise correlation values for simulated sequences, ranked in order of absolute magnitude.Black text labels are used to point out interactions explicitly embedded within our simulated ensemble, while a greyed out interaction between 3A and 5G occurrences was not explicitly placed within the data, but nonetheless obtained by a weakly regularized model.Five correct parameters recovered by a γ = 0.025 model are shown in circles, while those recovered by a γ = 0.010 are shown uncircled.Green circles show recovered interaction parameters of a positive nature, while red shows those which are negative.Uncolored circles denote correlations whose interaction parameters were proposed by neither model.

FigF
Fig F Sequence energy consistency despite incomplete training.Energy (E 90% ) violin plots for validation sequences sorted into equally populated frequency bins.The inset displays the relation between E 100% and E 90% energy values estimated for validation sequences when using parameters from models trained on full data and on the 90% training dataset respectively.

FigG
Fig G Parameter fitting through noisy data.Coupling parameters (left panel) and correlation values (right panel) estimated for different noisy ensembles as a function of the corresponding values obtained using the original H. sapiens dataset. Correlations

FigH
Fig H Joint statistical pattern including donor and acceptor sites.Two-site coupling patterns detected for the ensemble of extended splicing sequences composed by 9 (3 exonic and 6 intronic) donor sites, followed by 13 (12 intronic and 1 exonic) acceptor sites.Each box represents the observed frequency of a given nucleotide at a given site.Exonic and intronic sites were colored using warm and cold color palettes respectively.Red and green curves represent negative and positive interactions between site-base combinations respectively.The gray line visually separates donor and acceptor sites, and shows that no interactions cross between the 5'ss and 3'ss regions.

FigI
Fig I Energy landscape characterized through basins.Violin plots for the energy distribution of sequences belonging to different basins of attraction.Each violin plot shows the lowest and highest energies within the corresponding basin and summarizes the overall energy distribution.The first 15 violins correspond to basins whose minimum-energy sequence presents a GT-dinucleotide where common, at the start of the intron.The two last violins correspond to a cumulative set of non-GT attractor sequence basins, and another set of sequences that did not belong to the giant component of the graph at all.Red dots highlight the attractor (i.e.minimal) energy of each basin.The corresponding sequence is included as an x-axis label.The top label shows the number of sequences contained in each basin.

FigJ
Fig J Fowlkes-Mallows analysis.The dotted black line corresponds to observed B k values as a function of the number of clusters k.The expected values under the null hypothesis of no relation between clusterings are shown as a dashed black line.The full red line depicts the 95% one-side rejection region based on the asymptotic distribution of B k values.All calculations were performed using the Bk plot function implemented within the R package dendextend [2].
This behavior is demonstrated in Fig A (specifically for the case of Homo sapiens.