A Note on Exact Differences between Beta Distributions in Genomic (Methylation) Studies

We apply a known algorithm for computing exactly inequalities between Beta distributions to assess whether a given position in a genome is differentially methylated across samples. We discuss the advantages brought by the adoption of this solution with respect to two approximations (Fisher's test and Z score). The same formalism presented here can be applied in a similar way to variant calling.


Introduction
Average DNA methylation at a locus can be measured by Whole Genome Bisulfite Sequencing (WGBS), which determines the fraction of DNA strands methylated at any given genomic position in a population of cells (this definition is likely to sound obvious to those who already know about WGBS and too terse to those who don't: a good introduction to this kind of measurements is contained in chapter 11 of [1]). In what follows we will call this fraction h; when we distinguish between different samples we will write h 1 and h 2 . WGBS experiments estimate this numbers by measuring the methylation state of a random ( i.e. selected in some unpredictable way) set of reads sequenced from the sample. Since one can only analyze a finite number of reads per sample the value of h will be known only up to some variability.
In this paper we propose an answer to the basic question : how does one assess whether two cell populations have different methylation levels at a genomic position? Researchers in the field have already dealt with this issue in a variety of ways: for example [2] uses a Fisher's test. In [3] Sun et al. compute a confidence interval for h 1 {h 2 starting from some reasonable choice of a probabilistic model. Bsmooth [4] (which tackles the slightly different problem of defining differently methylated regions as opposed to positions) ultimately relies on a t-test. The authors of [5] use a hierarchical model to estimate the parameters needed for a Gaussian hypothesis test. Here we would like to bring attention to another possible approach, based on properties of the Beta distributions which are explained in [6], [7]. Similarly to e.g. [3] we do not test an hypothesis and output a p-value; rather we compute the probability distribution of the parameter of a Bayesian model.

Beta Distribution to Model Methylation Probabilities
The Beta probability distribution (over h) with parameters a,b is defined by where B is the Beta function Beta h appears very naturally in many studies of genomic data: typically such analyses also entail the comparison between different samples, which in turn means that different Betas have to be combined. Here for concreteness we are describing the case of measuring DNA methylation differences across samples via whole genome bisulfite sequencing but the same concepts apply with almost no change to genotyping. To appreciate how this variability can be quantified, consider a set of reads out of a WGBS experiment covering a certain genomic coordinate x with read depth d. Since not all the strands in the sample being sequenced will, in general, have the same bases methylated at the same time, this will be a collection of heterogeneous reads : some will indicate methylation at position x (these are the so called non converted reads), others (the converted reads) will correspond to molecules that are not methylated. Now, if h were known a priori, the probability of obtaining n non converted reads would be given by a binomial distribution (which is closely related to Beta h ): If one assumes a uniform prior on h, (P(h)~1, Vh[½0,1) the expression for P(hDn) is very similar (The factor 1 dz1 cancels out when applying Bayes' theorem) Therefore, to assess whether a position is differentially methylated across two samples with non converted reads respectively n 1 ,n 2 and read depths d 1 ,d 2 one has to compute The purpose of the software we will discuss in this note is to estimate P(h 1 wh 2 ) given the result of a WGBS experiment.

Exact Computation of Beta Differences
A method for computing which turns out to be efficient enough for our purposes is presented in full detail in [6], [7]. We will summarize its derivation here for the sake of completeness, and advise interested readers to study those papers for a more detailed discussion. We start with some preliminary definitions: let g(a,b,c,d):P(h 1 wh 2 ) where h 1 and h 2 are distributed respectively as Beta h (a,b) and Beta h (c,d).
Besides, we will use the notation I h (a,b) for the cumulative distribution function of the Beta distribution (also known as the incomplete Beta distribution). Now, by definition one has But then, using the identity ( [8]) where h~B (azc,bzd) Furthermore, one can prove that g(a,b,c,d) possesses a number of symmetries. An obvious one is g(a,b,c,d)~1{g(c,d,a,b). Also true are Using (2) and (3) one can design a nice recursive scheme where the base case is provided by g(a,b,a,b)~1 2 (this because if h 1 and h 2 have exactly the same distribution, P(h 1 wh 2 )~1 2 ).

Approximate Computation
Even if methylation data are well modelled by a Beta h , the comparison presented above is never (to our knowledge) used in the literature. As (hopefully fair) representatives of the methods which we have found are used instead, we will analyze the performances of the Fisher's test and that of a test based on a Gaussian approximation.
To do a Fisher's test, one builds a contingency table with the number of non converted and converted reads in the two samples (note that this kind of test breaks down when one of the rows (or columns) of the contingency table is zero). In the Gaussian approximation, one models P(h) for each sample with a Gaussian with the same mean and variance of Beta h ; and then uses the two Gaussians to test for differences between h 1 and h 2 . In both cases we will consider one tailed tests.

Comparison with Approximate Results
We organized the comparison between the exact and approximate solution in two steps. First, we looked at the behaviour of the two tests on a pair of real samples (see below for instructions on how to access the data we used).
The results are shown in figure 1. On the x axis we plotted P(h 1 wh 2 ), on the y axis we plotted the corresponding p-value obtained by approximating the Beta respectively with a Fisher's test (on the left) and with a Gaussian (on the right). We did the comparison over 100000 positions : the plot is in fact a two dimensional histogram, in which different shades of blue indicate how many times the two values fall into a certain region of the plane. There is not much to comment there except to note that, as expected, there is a broad correspondence between the different methods. Also, at such a scale the Beta probabilities seem more similar to the Z score test than to the Fisher's p-values (the right hand side plot looks more like a diagonal).
Next, we simulated a pair of samples whose counts are generated by the same underlying binomial process (i.e. h 1~h2~0 :5) at different coverages. These constitute a negative control, in the sense that none of the methods should report a significant difference between the samples. Furthermore, we generated a pair of samples such that their underlying binomial probabilities are markedly different h 1~0 :9,h 2~0 :5; those are the true positives, i.e. cases for which the tests should detect that h 1 wh 2 . We then compare the receiver operating characteristic (ROC) curves of the three methods for different values of the samples' coverages, d 1 ,d 2 . The results are depicted in figure 2. That plot justifies the usage of the Beta h distribution: the number of false negatives accumulated by the other two methods considered stops them from reaching an high enough true positive rate (even when the threshold for computing it is very permissive). Note, for example, that the blue line is not even visible in the leftmost panels of figure 2. This effect is also shown in figure 3 where we depict the distribution of the outputs for the three methods at read depth d 1~1 0,d 2~1 0.

Differentially Methylated Regions and Effects of Coverage
Using the above concepts, we can compute differentially methylated regions (DMR) along the genome : these are uninterrupted blocks of nucleotides where the two samples have different methylation. One possible technique to find such blocks is to conjoin a number of adjacent nucleotides in a DMR, disregarding their exact methylation probabilities, and to assign hard boundaries. This usually implies that a number of ad hoc rules must be established to control the minimum distance between 2 neighbouring DMRs, the minimum length of a DMR, how to exactly count the intersection of DMRs with annotated regions, and so on and so forth. Using our method, though, one can simply assign to each nucleotide the probability computed by the algorithm presented here; any further analysis can be conducted without imposing arbitrary threshold or boundaries. For instance one can ask what is the average value of this probability over some specific regions (introns, enhancers) with respect to randomly chosen regions of the genome. Often it is not clear a priori what is the correct scale to use when looking at methylation : if this is the case, one can smooth the probability per nucleotide by computing a kernel density estimation at various bandwidths, or simply clump together the values of a number of nearby bases in a single (average) value. Note that smoothing is justified by the fact that methylation levels are correlated in space (the strength and persistence of the correlation is different from sample to sample, reflecting technical and biological variability); in fact as hinted at in [4], analyzing together nearby positions could provide a way of correcting measurement errors.
We would also like to comment on the fact that the different coverage of the samples does have an effect on the estimation of differential methylation. The main idea to understand here is that low coverage means uncertainty: and uncertainty can give rise to results which, while correct, are slightly counterintuitive. For example in figure 4 we show that a sample with low methylation and low coverage can be (maybe, one cannot say for sure) more methylated than a sample with high, certain methylation. The right panel of the same figure suggests that a good way of filtering for certainty is to select positions with low estimated variability (rather than to select based on read counts): this is because the same read depth can correspond to different variances depending on how many reads are non converted or converted.
Finally, once one has the estimates for h 1 and h 2 (as obtained via the ratio of unconverted reads over the coverage) and P(h 1 {h 2 w0) ( i.e. the output of the algorithm expalined in this paper) one can take an informed decision on a locus, keeping into account both the size of the difference in methylation and its variability.

Implementation and Data Availability
The algorithm described above is implemented in a C program, called methyl_diff, available from the Github page of one of the authors : http://emanueleraineri.github.io/. The program takes as input (from stdin) four integers, i.e. the number of non converted and converted reads for the first and the second sample respectively, and prints P(h 1 wh 2 ) on the stdout. It takes 3:3s to process 10 5 lines on off-the-shelf hardware (MacBookPro with Intel i7@2.66 GHz). Note that the data used to produce figure 1 are publicly available (they were generated for BLUEPRINT, a consortium, studying epigenetic marks in immune system cells.) in at least two ways (also corresponding to two different formats): 1. First of all, they can be downloaded from the same web page where the source code of our implementation is stored. The file G199.G202 contains the methylation levels of 100000 random positions from the chromosome 1 of samples G199 and G202 Exact Differences between Betas in Methylation PLOS ONE | www.plosone.org (first we determined which positions had been sequenced in both samples; then we extracted a random subset of those). One can feed columns 6,7,10,11 directly to the methyl_diff executable (those columns are the unconverted, converted reads from the two samples). 2. Secondly, they can be downloaded from the BLUEPRINT project ftp site ftp.ebi.ac.uk/pub/databases/blueprint/data/ homo_sapiens/.