A maximum-entropy model for predicting chromatin contacts

The packaging of DNA inside a nucleus shows complex structure stabilized by a host of DNA-bound factors. Both the distribution of these factors and the contacts between different genomic locations of the DNA can now be measured on a genome-wide scale. This has advanced the development of models aimed at predicting the conformation of DNA given only the locations of bound factors—the chromatin folding problem. Here we present a maximum-entropy model that is able to predict a contact map representation of structure given a sequence of bound factors. Non-local effects due to the sequence neighborhood around contacting sites are found to be important for making accurate predictions. Lastly, we show that the model can be used to infer a sequence of bound factors given only a measurement of structure. This opens up the possibility for efficiently predicting sequence regions that may play a role in generating cell-type specific structural differences.


First-order maximum-entropy model
Here we present a first-order maximum-entropy model only constrained to reproduce one-spin statistics σ k of the experimental distributions of neighborhoods σ. Similarly to the second-order model presented in the main text, the first-order maximum-entropy distribution can be derived using the method of Lagrange multipliers [1][2][3] and has the following form: where h k are Lagrange multipliers that constitute the fitting parameters of the model. The partition function Z( σ|·) is obtained by summing the numerator over all possible σ neighborhoods. (In the above, we use "|·" to summarize that we were fitting two different conditions, namely P ( σ|c, d) and P ( σ|d) ). By noting that Z( σ|·) = k e h k + e −h k , we can rewrite Eq. 1 as the product of terms that only depend on k The terms in the product are normalized to unity and thus find that P ( σ|·) simply becomes the product of the independent probabilities for each σ k in σ, The value of h k that matches the σ k statistic can be found by solving which gives S4 Fig.(A,B) shows that the distributions calculated from Eq. 3 successfully predicted the experimental statistics σ k on the test data. However, this first-order maximum entropy distributions failed to capture both the second-(S4 Fig.(C,D)) and third-(S4 Fig.(E,F)) order statistics that were not incorporated into the fit. In S4 Fig.(G and H) we also see that the predicted sequence neighborhood probabilities P ( σ|·) from the model did not agree with their frequencies as seen in the data.
We thus conclude that the first-order maximum-entropy distribution in Eq. 3 is not a good-enough description of the experimental distributions of σ and further-order moment distributions need to be considered.

Inspection of model parameters
For each distance of contact d = |j − i| we obtained two sets of parameters for P ( σ|·), one describing the probability of a given neighborhood σ given contact between i and j, , and another set describing the probability of a given neighborhood regardless of contact (which we term here as background), of the distribution sequences regardless of contact was greater than the entropy of the distribution of contacts at short distances (d < 330 Kbp) and the opposite happened for longer distances. In particular, the plot displays the average number of neighborhood configurations encoded in the probability distributions, 2 S . S5 Fig.B displays the Kullback-Leibler (K-L) divergence between neighborhoods given contacts and background neighborhoods at each distance d, that can be interpreted as a distance between the two probability neighborhood distributions, or the information gain when including information about contacts into the background distribution of neighbors, therefore going from P ( σ|d) to P ( σ|c, d). This quantity diminished with distance and saturated at its lowest value at a distance of ∼ 300 Kbp. We then explored the similarity between the distributions of contacting neighborhoods at different distances. Specifically, we subtracted the K-L divergence of the background from the K-L divergence of the contacting distributions (correcting for background sequence effects), since these were the parameters involved in the distance-normalized contacts as it can be derived from Eqs. 1 and 2 in the main text: For every distance of contact, the parameters ∆h d k and ∆J d kl were concatenated into a vector, and the set of vectors corresponding to all distances was then clustered into two groups with K-means [5]. The coefficient vectors naturally separated at the distance of contact of 390 Kbp (S5 Fig.D). In S5 Fig.E and S5 Fig.F we show the average energetic coefficients of enrichment of the two K-means clusters, which differ both in ∆h k and ∆J kl .
In addition, we applied Principal Component Analysis (PCA) [6,7] to the same set of coefficient vectors as above, and the first principal component (PC1) clearly separated the same clusters previously found by K-means delimited at a distance of contact of 390 Kbp (S5 Fig.G). PC1 shows positive scores for the shorter distances of contact (d < 390Kbp) and negative scores for the longer distances of contact (d ≥ 390 Kbp). Therefore, projecting the vectors of coefficients onto PC1, we found how the short-distance cluster differs from the long-distance cluster (S5 Fig.H), which corresponded to an increase of ferromagnetic interactions between the sites situated inside the loop.

Predicting sequence given structure
Given fitted maximum entropy distributions, P ( σ|c, d) and P ( σ|d), over a range of genomic distances d, we now work out how to solve the inverse problem, namely finding the probability of a genomic site k being in a particular binary state σ k , given only structural data from a set of Hi-C counts {n ij }. We denote this probability by P (σ k |{n ij }), where {n ij } is the set of counts between all (i, j) pairs of sites considered to be neighbors of k in our model.
As in the main text, Bayes' theorem gives where P ({n ij }|σ k ) = ij P (n ij |σ k , d). P (n ij |σ k , d) is the probability of observing exactly n ij contact counts between a pair of sites a distance d = |j − i| apart given that the genomic site k in their sequence neighborhood is in a particular state, σ k . (Note that d is a redundant variable whenever i and j are specified, ie. P (n ij |σ k ) = P (n ij |σ k , d). We nevertheless introduce it here for consistency with the rest of our notation). P (σ k ) is the prior on σ k and is the probability for site k to be in one of the two states (here we take it to be a constant over the genome, with the same value as measured in the training set P (σ k = 1) = 0.31). P ({n ij }) is simply a normalization constant and is found by summing the numerator over σ k . Rewriting Eq. (10), we have, Using k to label the position that the genomic site k takes in the particular neighborhood of (i, j), σ = {σ 1 , · · · , σ k , · · · , σ N }, and considering P (σ k |d) = P (σ k ) we PLOS 3/5 then rewrite P (n ij |σ k , d) as where the Kronecker delta δ σ k ,σ k ensures that we sum all possible sequences of the neighborhood σ that have genomic site k held fixed in a particular state σ k . Next, by combining Eqs. 11 and 12, we obtain where M is the number of (i, j) pairs that have sequence neighborhoods that contain genomic site k. The distribution, P ( σ|d) is taken to be the fitted maximum entropy distribution at a distance d. We assume that the probability of observing n ij Hi-C counts given a sequence state σ, P (n ij | σ, d), is as a Gaussian distribution N (λ σ,d , ζ 2 σ,d ) with a mean number of counts, λ σ,d , proportional to the fitted probability of contact for the given sequence neighborhood σ, where K is a constant that depends on experimental details such as the number of cells used and the efficiency of contact detection, and λ(c|d) = KP (c|d) = n(d) is the experimental average of Hi-C counts at a distance d. With this and the two fitted maximum entropy distributions, we can calculate the mean number of counts for a given sequence neighborhood from Eq. 14. The variance ζ 2 σ,d is sampled from the train set as a function of λ. Specifically, we calculated the rates λ ij associated to all Hi-C counts n ij from the train set. Then, for various values of λ we collected the Hi-C counts n ij that our model had assigned a rate λ ij between 0.9 × λ and 1.1 × λ. Lastly, we calculated the variance of the rate-associated counts ζ 2 (λ) and fitted a polynomial curve to it (we found ζ 2 ≈ λ 2 ).
Everything in Eq. 13 is now determined and so the probability of a particular sequence state (either σ k = 1 or σ k = −1) at every site k can be calculated if given a Hi-C contact map, {n ij }.