Information Thermodynamics of Cytosine DNA Methylation

Cytosine DNA methylation (CDM) is a stable epigenetic modification to the genome and a widespread regulatory process in living organisms that involves multicomponent molecular machines. Genome-wide cytosine methylation patterning participates in the epigenetic reprogramming of a cell, suggesting that the biological information contained within methylation positions may be amenable to decoding. Adaptation to a new cellular or organismal environment also implies the potential for genome-wide redistribution of CDM changes that will ensure the stability of DNA molecules. This raises the question of whether or not we would be able to sort out the regulatory methylation signals from the CDM background (“noise”) induced by thermal fluctuations. Here, we propose a novel statistical and information thermodynamic description of the CDM changes to address the last question. The physical basis of our statistical mechanical model was evaluated in two respects: 1) the adherence to Landauer’s principle, according to which molecular machines must dissipate a minimum energy ε = kBT ln2 at each logic operation, where kB is the Boltzmann constant, and T is the absolute temperature and 2) whether or not the binary stretch of methylation marks on the DNA molecule comprise a language of sorts, properly constrained by thermodynamic principles. The study was performed for genome-wide methylation data from 152 ecotypes and 40 trans-generational variations of Arabidopsis thaliana and 93 human tissues. The DNA persistence length, a basic mechanical property altered by CDM, was estimated with values from 39 to 66.9 nm. Classical methylome analysis can be retrieved by applying information thermodynamic modelling, which is able to discriminate signal from noise. Our finding suggests that the CDM signal comprises a language scheme properly constrained by molecular thermodynamic principles, which is part of an epigenomic communication system that obeys the same thermodynamic rules as do current human communication systems.

I processed by the activity of methyltransferases and demethylases. To estimate the amount of information associated with methylation changes, a methylome is split to N genomic regions of length l , and information R I is computed according to Eq. 3 in each region R .
The general assumptions for the model are: 2) Methyltransferase/demethylase activities at different genomic regions are independent of one another. In addition, kinetic parameters and mechanism of enzymatic reaction catalyzed by methyltransferases are assumed to be consistent across different genomic regions.
3) Cytosine DNA methylation (CDM) changes induced by thermal fluctuations in non-overlapping genomic regions are independent for all the genomic regions. 4) There is a large, but finite, range of possible values of energy dissipation and any amount of energy

Derivation assuming a Binomial process
We assume that the dissipation of each particular value of energy R E follows a binomial process. In consequence, if the energy R E associated to the CDM changes induced by thermal fluctuations is consistent with a binomial process (assumption 3 and 4), then we can distinguish these CDM changes from those originating by the regulatory methylation machinery (the biological signal), because it is well known that the latter are not independent for all genomic regions. Under this assumption, the probability that a particular value of Next, a natural statistical mechanical assumption considers the frequencies   , where  is a scaling parameter. The Boltzmann factor, reveals the relative probability of a particular arrangement (with a given energy). The experimental data confirm the last assumption (see below Fig.1 and 2).
(1), where Nq is the expected number of times that an amount of energy where   l c is a constant parameter that depends on the genomic region size l . Equation 2 . Thus, it can be assumed that

Derivation assuming a Poisson process
We assume the CDM changes induced by thermal fluctuation follow a Poisson process. Since Poisson is a limiting case of binomial process, the former inherits properties of independence from the underlying binomial process. That is, given a Poisson process, the probability that the value of energy R E can be dissipated exactly n times in N genomic regions is given by the binomial distribution: . The binomial distribution can be analyzed as a function of the expected number of times that . Then, as the number of genomic regions becomes and the rest of the reasoning to derive Eq. 5 follows as presented in the previous section.

Testing distribution of data
To expedite testing of the distribution of the R I data, we have provided a homemade R script for a function called "fitCDF". This function requires previous installation of the R packages "minpack.lm" and "numDeriv". We provide two files of data to illustrate our analyses: 1) "Four_ecotypes_CGs_IG_2000bp.RData" and 2) "Four_ecotypes_CGs_IG_5000bp.RData", which contain GRanges objects (created with R package "GenomicRanges") with the partition of four Arabidopsis methylomes (four ecotypes) into non-overlapping windows of 2000 and 5000 bp, respectively. A small R script to visualize these data is given below.
The variable carried by the GRanges object is called "IG". Herein, we'll write an example with the Arabidopsis ecotype "Seattle_0".
Depending on machine computational power, the fit of Generalized Gamma (GG) distribution will vary in duration. Following return of plots, a list object with the following values is provided:  aic: Akaike information criterion  fit: list of results of fitted distribution, with parameter values  bestfit: the best fit distribution according to AIC  fitted: fitted values from the best fit  rstudent: studentized residuals  residuals: residuals The first plot corresponds to the best model according to Akaike information criterion (AIC), which in the current case is GG distribution (Fig. 5). The main problem with the fitting is that the scale parameter does not make physical sense, since it approaches zero.
From the experimental data shown in Figs 1 to 2, we know that the scale parameter differs from zero and increases with size of the genomic region. Thus, although the AIC indicates that this is the best model, it is discarded from a physical standpoint. This outcome is an artifact of the numerical fitting algorithm. The second best model is the 3-parameters Weibull distribution (Fig. 6).  Figure 5. The best fit model according to AIC for R I data from "Seattle_0". This model is discarded since the scale parameter is very close to zero.
To get the list of all fitted results, we can type: fit$fit. For the fitting result of the 3-parameter Weibull distribution we can type: About Q-Q norm and Kolmogorov-Smirnov goodness of fit problems for large data sets In Figs. 5 to 6, it appears that the expected normal distribution for the Studentized residuals derived from the non-linear fit of R I has some problems. However, the problem is neither in the experimental data nor in the non-linear fit, but in the Q-Q norm plot when the size of the dataset is large enough. An analogous issue is found for Kolmogorov-Smirnov goodness of fit.
We supplied an R function "qq.weibull" to illustrate the issue by simulation. This function generates random numbers according to a specified Weibull distribution and then a small white-noise is added by using the R-base function "jitter" (see the details of this function in R by typing ?jitter).

 Example 1:
This piece of script will yield Fig. 7  The QQ and PP plots, as well as the linear regression analysis "theoretical probabilities" versus "empirical probabilities" and Kolmogorov-Smirnov test, tell us that this fit is okay.
However, we can see that the QQ-norm plot is reflecting a small issue (Fig. 7).

 Example 2:
The increment of the data size, retaining the same parameter setting, eliminates the problem: This sample size is consistent with the partition of Arabidopsis methylome into nonoverlapping windows of 5000 bp (see the GRanges object above). Results are presented in