Evaluating the Stability and Flexibility of DNA Methylation Patterns from Stem to Differentiated Cells

DNA methylation has been studied extensively in many developmental systems. Little attention, however, has been given to identifying methylation features that distinguish loci whose patterns are in flux in a given cell lineage from those whose patterns are stable. Here, we develop a new metric, the Ratio of Concordance Preference (RCP), to quantify and compare epigenetic flexibility and stability across loci, cell types, and developmental stages, without assuming any specific biochemical mechanisms. We apply RCP to double-stranded DNA methylation data from human and murine cells and conclude that: (i) preference for concordant DNA methylation is reduced but not eliminated in stem relative to differentiated cells; (ii) cellular differentiation is characterized by increasing preference for concordant methylation states; and (iii) while concordance preference remains substantial through embryonic totipotency and early stages of pluripotency, primordial germ cells initially have nearly no preference for concordance, perhaps reflecting the high level of epigenetic flexibility en route to production of gametes. The mechanism-free nature of RCP will enable 1/33 peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/072488 doi: bioRxiv preprint first posted online Aug. 31, 2016;

Organismal development is characterized by a shift from the phenotypic flexibility of 2 embryonic cells to the canalized identities of the lineages that lead to differentiated cells. 3 To achieve stable gene-regulatory states in terminally differentiated cells, organisms 4 ranging from archaea to humans use a variety of epigenetic mechanisms, including DNA 5 methylation. Perturbation of the state of DNA methylation at various loci in 6 differentiated cells is associated with several human cancers [1][2][3]; in turn, restoring 7 epigenetic flexibility of some loci has proven challenging in efforts to produce induced 8 pluripotent stem (iPS) cells [4]. Together, these findings highlight the importance of 9 shifts between epigenetic flexibility and stability in establishing cellular identity. 10 There exists an extensive literature documenting changes in single-locus and 11 genome-wide methylation frequencies at various stages of development [5,6]. Most methylation states on parent and daughter strands, which are separated by exactly one 48 round of DNA replication. RCP requires no assumptions about the enzymatic 49 mechanisms of methylation and demethylation, and so enables comparison across 50 diverse species and developmental stages. In a recent review, Jeltsch and Jurkowska [20] 51 emphasized the balance of methylating and demethylating processes -rather than the 52 propagation of specific individual methylation patterns -as the primary determinant 53 of the overall set of methylation patterns present in a given cellular population at a 54 given time. In this framework, RCP can be thought of as a metric for quantifying the 55 extent to which the patterns produced by a given system of methylating and 56 demethylating processes deviate from the sets of patterns expected if methyl groups are 57 placed entirely at random. 58 In parameterizing RCP, we use the term "conservative", in lieu of "static" as used 59 previously [16], to describe processes that preferentially establish concordant as opposed 60 to discordant methylation states. We consider non-conservative processes, described 61 previously as "dynamic" [16], as having one of two forms: "random" processes that add 62 or remove methyl groups with equal preference for concordance and for discordance, and 63 "dispersive" processes that preferentially establish discordant methylation states.
RCP analysis, we also examine patterns from three recent publications that utilized 70 hairpin-bisulfite PCR of murine DNA [14,15,17]. To improve our understanding of the 71 processes by which stem cells are reprogrammed to give rise to differentiated cells, we 72 ask: (i ) how strong the preferences are for concordant DNA methylation states in 73 cultured and embryonic stem cells; (ii ) whether this preference changes as stem cells 74 shift between undifferentiated and differentiated states; and (iii ) how preference 75 changes in cells during embryonic development. 76

77
Defining the Ratio of Concordance Preference for All Possible 78 Methylation Configurations 79 We have developed the Ratio of Concordance Preference (RCP) to assess the 80 strategy of binary information transfer, with focus on the degree to which exact 81 information is conserved. We apply our RCP framework to DNA methylation in 82 mammalian cells. Our goal is to infer whether and how much the system of processes 83 that established a given set of methylation patterns prefers concordant to discordant 84 methylation states. This general formulation is free of assumptions about the molecular 85 mechanisms whereby methylation is added to and removed from DNA. 86 In our data from double-stranded DNA molecules from human and mouse, 87 methylation occurs principally at the CpG motif. This symmetric motif may be written 88 as CpG/CpG, here termed "CpG dyad". CpG dyads have opportunities for methylation 89 on both strands. The methylation state of a dyad is thus of one of three forms: fully RCP framework can also be extended to non-CpG methylation at symmetric nucleotide 94 motifs. 95 To infer concordance preference for sets of double-stranded methylation patterns, we use the overall frequency of methylation, m, and the frequency of unmethylated dyads, U , of each data set. Because m is derived from the three dyad frequencies, the pair (m, U ) encompasses the full information available from the implicit dyad frequencies, M and H. We evaluate the extent of deviation from expectations under a random model in which the system has no preference for either concordant or discordant placement of methyl groups, using RCP, defined as: We note that the expectations of a random model, for which RCP = 1, are met both 96 by truly random placement of groups, and by equal contributions from processes 97 operating with strong preference for concordance and processes operating with strong 98 preference for discordance. Under this random model, the frequency of unmethylated 99 dyads is given by U = (1 − m) 2 , leading to dyad frequencies as expected under the 100 binomial distribution (Fig 1a-b dashed curve; Fig 1c). A system for which methylation 101 processes are dominated by the activity of the mammalian Dnmt3s, which add methyl 102 4/33 Individual patterns, with methylated and unmethylated cytosines indicated in red and blue, respectively, can reflect (c) a random methylation system; (d) a fully conservative system, with complete preference for generating concordant dyads; (e) a fully dispersive system, with complete preference for generating discordant dyads; or (f) a partially conservative system, with more concordant dyads than expected under random processes, but fewer than expected under fully conservative processes. Systems can also exhibit partial preference for dispersive placement of methyl groups. (g) A representative partial double-stranded DNA methylation pattern collected using hairpin-bisulfite PCR. The experiment-specific batchstamp is shown in green, and can be used to monitor for PCR contamination; the molecule-specific barcode shown in gray, generalized as "DDDDDDD", can be used to identify redundant sequences. The batchstamp and barcode are encoded on the hairpin oligonucleotide used to join the top and bottom strands. Primer-binding sites are underlined at the left end of the molecule. groups principally without regard to the methylation state of the other strand [21], is 103 expected to behave largely in accordance with the random expectation.

104
One set of deviations from the random expectation is characterized by preference for 105 concordant placement of methyl groups, such that the two classes of concordant dyads -106 fully methylated and fully unmethylated -are more frequent than expected under the 107 random model. This situation occurs under conservative systems of methylation where 108 strong contributions from maintenance-like processes, such as the activity of Dnmt1 in 109 mammals [11,13,22], lead to high frequencies of concordant dyads. In the extreme form 110 of this deviation from random, methyl groups are observed only in fully methylated daughter-strand methylation, and perhaps in genomic regions undergoing demethylation 118 during periods of epigenetic transition. When methylation is maximally dispersive, and 119 methylation frequency m is less than 0.5, all dyads with methylation will be 120 hemimethylated (Fig 1e), such that U = 1 − 2m (lower diagonal line in Fig 1a-b); when 121 m is greater than 0.5, not all methyl groups can be accommodated in hemimethylated 122 dyads, and so a combination of hemimethylated and fully methylated dyads -but no 123 unmethylated dyads -is expected (lower horizontal line in Fig 1a).

124
The two extreme deviations from random form the boundaries of the comprehensive 125 space of possible configurations of methylation at symmetric motifs (Fig 1a). Sets of 126 double-stranded methylation patterns are expected to fall on the continuum between 127 the extreme expectations (Fig 1b), and can be located within this space to characterize 128 the processes that gave rise to a given data set, ranging from conservative to dispersive. 129 As noted above, the "Ratio of Concordance Preference" (RCP), for which the 130 mathematical derivation is given in Supporting Information (S1 Text), measures 131 whether and how severely the dyad frequencies of a given set of double-stranded DNA 132 methylation patterns deviates from the assumption that methyl groups are arranged at 133 random on DNA, with fully methylated, hemimethylated, and unmethylated dyads 134 present at the frequencies expected under the binomial distribution. This expectation is 135 analogous to the distribution of genotype frequencies at a two-allele locus in a 136 population that is at Hardy-Weinberg equilibrium [23,24]: RCP 2 is equal to 4M U/H 2 , 137 which is expected to equal 1 at Hardy-Weinberg equilibrium.

138
A system with an RCP value of 1 has no preference for either concordance or 139 discordance of methylation, and follows the random model (Fig 1c). An RCP value of 2 140 indicates two-fold preference for concordance, while an RCP of 1 2 indicates two-fold 141 preference for discordance. RCP approaches infinity for systems that have complete 142 preference for concordant dyads. At the other extreme, RCP approaches 0 (i.e., 1 ∞ ) for 143 systems that have complete preference for discordant dyads. For the examples analyzed 144 here, data for different loci and cells range from complete concordance to near-random 145 positions on the RCP spectrum. Complete discordance is found as a transient condition 146 of adenine methylation at the ori locus in Escherichia coli, and serves to regulate the 147 timing of reinitiation of DNA synthesis [25,26]. Dispersive methylation is also expected 148 to exist ephemerally in the instant between DNA replication and methylation. During 149 this instant, all methyl groups present are expected to be on the parent strand, as 150 groups have yet to be added to the daughter. Thus, a broad spectrum of concordance 151 preference can exist in organisms, and can be quantified and evaluated by RCP.

152
For large and intermediate-size data sets, the resolution of RCP is high across the 153 range of possible methylation frequencies, although the resolution declines as m closely 154 approaches 0 or 1, such that RCP cannot be inferred for completely methylated or 155 unmethylated genomic regions. Nonetheless, RCP can usually be inferred with high 156 confidence using data from only a few hundred dyads. Our new approach therefore 157 requires far fewer sequences to estimate the important parameter of concordance 158 preference than do methods that focus on inferring rates for specific enzyme 159 activities [22,27]. 160 Here, we apply RCP to investigate further the conclusion of Shipony et al. [16] that 161 methylation in cultured stem cells is dominated by non-conservative processes, with 162 little or no preference for concordance. Using double-stranded methylation patterns 163 collected by our group, by Arand et al. [14,17], and by Zhao et al. [15], we assess 164 methylation concordance in cultured human and murine stem cells, as well as in murine 165 Figure 2. Inferring RCP for single-copy loci in differentiated human and murine cells. Methylation in human and murine differentiated cells was consistently inferred to have strong contributions from conservative processes for data sets that span a wide range of methylation frequencies. For each locus, we inferred the point estimate of the m, U pair and the two-dimensional 95% confidence region, determined by uncertainty in each variable. The intensity of coloration in a given part of a confidence region reflects the confidence level (α) at that point. The m, U point estimates for most data sets are indicated with white asterisks, and the corresponding RCP values, corrected for bias, are given in the associated table. A larger, colored asterisk is used when the confidence interval of a data set is too small to be readily visible. (a) Data from three human loci -G6PD, FMR1, and LEP -and one murine locus -Lep -collected using hairpin-bisulfite PCR and dideoxy sequencing, are taken from published [22,[27][28][29] and previously unpublished (Table S3) data. (b) Data from four single-copy loci from murine embryonic fibroblasts -Afp, Igf2, Snrpn, Tex13 -collected by Arand et al. [14] using hairpin-bisulfite PCR and pyrosequencing; relevant values are given in Table S4. Information on the regions of each locus for relevant methylation data are given in Tables S3 and S4. All data were corrected for inappropriate and failed conversion of methylcytosine using methods described in Supporting Information (S4 Text). Conversion-error rate estimates and 95% confidence intervals on RCP are given in Tables S3 and S4. cells undergoing early developmental transitions that give rise to totipotent embryonic 166 cells.

167
Inferring RCP for Single-Copy Loci in Differentiated Cells 168 We first examined methylation concordance at single-copy loci in differentiated cells, 169 including those from human adipose, leukocytes, and cultured fibroblasts, and from 170 murine somatic tissues. Our previous work with single-copy loci in differentiated cells 171 has revealed a substantial role for maintenance methylation, a conservative process, 172 with a comparatively minor role for non-conservative de novo processes [22,27]. We 173 therefore anticipated that RCP analysis would indicate substantial preference for  Table S1; 176 see Table S3 for 95% CI on these point estimates), reflecting, as anticipated, a 177 substantial role for conservative processes. It is noteworthy that the RCP values 178 inferred for these loci have close correspondence with Hemi-Preference Ratio, a metric 179 we have used previously (S2 Text).

180
A substantial role for conservative methylation processes in murine differentiated 181 cells was also revealed by RCP analysis. Data sets from Stöger [28] for adipose tissue, 182 and from Arand et al. [14] for embryonic fibroblasts (MEFs) each gave an RCP point 183 estimate greater than 18.2 (Fig 2a and b). These analyses of single-copy loci in human 184 and murine differentiated cells revealed, in addition, that RCP values indicative of a 185 strong role for conservative processes can be inferred for data sets with disparate 186 methylation frequencies, ranging from sparse to dense (Fig 2). 187 Figure 3. Inferring RCP in human and murine stem cells. Methylation patterns in undifferentiated, cultured human and murine stem cells were consistently inferred to have substantial contributions from conservative processes, with concordance much greater than the random expectation. (a) Data from human L1 and Lep, collected using hairpin-bisulfite PCR and dideoxy sequencing, are taken from published [28] and previously unpublished work (Table S3). (b) Data from three single-copy loci and four multi-copy loci from murine ES line J1 cells, Igf2, Snrpn, Tex13, B1, IAP, L1, and mSat, were collected by Arand et al. [14] using hairpin-bisulfite PCR and pyrosequencing (Data in Table S4). "All DNA" data were collected by Zhao et al. [15] using hairpin-bisulfite PCR and pyrosequencing, and assess methylation at approximately 17.3% of CpG/CpG dyads in the murine genome (Table S5). All data were corrected for inappropriate and failed conversion of methylcytosine using methods described in Supporting Information (S4 Text). Conversion-error rate estimates and 95% confidence intervals on RCP are given in Tables S3, S4, and S5.
Inferring RCP for Loci and Genome-Wide CpGs in Cultured 188 Stem Cells 189 We next analyzed concordance preference in undifferentiated, cultured stem cells to 190 assess with RCP the conclusion of Shipony et al. [16] that methylation in such cells 191 occurs primarily through non-conservative, random methylation. All six of the human 192 stem and iPS cell lines that we examined, when grown under non-differentiating 193 conditions, gave RCP point estimates within the narrow range of 3.41 to 5.20 (CI given 194 in Table S3) for the 5'-UTR of the multi-copy locus L1, showing little evidence of 195 heterogeneity in RCP (p > 0.29, pairwise two-tailed permutation tests (PTs) ; Fig 3a; 196 Table S3; further discussion in S5 Text). The single-copy Lep locus in a cultured murine 197 ES line [30] yielded an RCP within this range (Fig 3a), and not significantly different  To extend our analysis of methylation in stem cells, we evaluated published data 203 from additional loci [14], CpGs across the genome, and other classes of genomic 204 elements [15] from cultured murine ES cells. Three of the multi-copy loci -B1, IAP, 205 and L1 -had RCP values within the narrow range described above (Fig 3b), with no 206 evidence of heterogeneity among the three loci (p > 0.55, pairwise two-tailed PTs; 207 inferred from data of Arand et al. [14]). One multi-copy locus -mSat -yielded a 208 significantly higher RCP value (p < 0.017, pairwise two-tailed PTs). Single-copy loci -209 Igf2, Snrpn, and Tex13 -also had significantly higher RCP values (p < 0.019, pairwise 210 two-tailed PTs).

211
Near-genome-wide double-stranded methylation data from Zhao et al. [15], gave an 212 inferred RCP of 5.22 for "all CpGs" in DNA from undifferentiated, cultured murine ES 213 cells. For other classes of genomic elements in this genome-scale study [15], we inferred 214 RCP values of 4.31 or greater (Table S5). In all cases, RCP values were significantly 215 greater than the value of 1 expected under random methylation processes (RCP ≥ 3.69; 216 p < 10 −6 , maximum-likelihood comparison test (MLCT); Fig 3b). Thus, RCP values 217 greater than 1 are a consistent feature of cultured embryonic stem cells, and hold across 218 a broad set of genomic locations and element categories. 219 We considered the possibility that spontaneous differentiation had produced 220 subpopulations of cultured stem cells that might account for the inference of RCP 221 values substantially greater than 1 at all loci and genomic elements examined. Our 222 calculations revealed that a possible subpopulation of differentiated cells operating at 223 much higher RCP than that of undifferentiated cells would need to comprise more than 224 50% of the population to account for our finding (S9 Text). Morphological inspection of 225 the cultured human stem cells under non-differentiating conditions did not suggest the 226 presence of a substantial subpopulation of differentiated cells in any of these lines. We 227 conclude that overall methylation in cultured mouse and human ES and iPS cells is   Our finding of substantial preference for methylation concordance in these data from 231 cultured, undifferentiated stem cells contrasts with the inference of Shipony et al. [16] 232 9/33 that DNA methylation in such cells is dominated by non-conservative, random 233 processes. This disparity led us to ask whether our approach here for data acquisition 234 and analysis is indeed capable of identifying data sets established under exclusively 235 random processes, which are expected to yield RCP values of 1 (see "Defining the Ratio 236 ...", above). To assess this capacity, we considered data collected from the murine Dnmt 237 and Np95 knockout (KO) lines, and compared RCP values at various loci to those for 238 normal murine ES and differentiated cells [14,28,30].

239
The Dnmt1KO cell line is of particular interest in this regard because Dnmt1, known 240 as the maintenance methyltransferase, has the strongest known preference for 241 establishing concordant methylation [11,22,31]. For murine ES cells homozygous for a 242 knockout at Dnmt1, or for a knockout mutation at Np95, which encodes a protein 243 critical for Dnmt1 maintenance activity [32], we inferred a range of RCP point estimates 244 of 1.02 to 2.35 for repeated elements and single-copy loci from Arand et al.  Table S3). 248 DnmtKO and Np95KO cell lines had RCP values less than those inferred for wildtype 249 murine stem cells at all loci examined (p < 0.049, two-tailed PTs; Table S4). While for 250 ten of thirteen data sets examined, RCP values were inferred to be greater than 1, in  In the absence of Dnmt1, RCPs inferred for cultured murine ES cells were close to 1, the random expectation, at some loci. Data from Lep, collected using hairpin-bisulfite PCR and dideoxy sequencing, are from Al-Alzzawi et al. [30]. Data from seven additional loci are from Arand et al. [14], who used hairpin-bisulfite PCR and pyrosequencing. Our analysis of these published data revealed two of the eight loci analyzed -Afp and B1 -to have very low RCP values whose 95% confidence intervals include the possibility that methylation was established entirely by random processes. Data were corrected prior to RCP analysis for inappropriate and failed conversion of methylcytosine using methods described in Supporting Information (S4 Text). Conversion-error rate estimates and 95% confidence intervals on RCP are given in Tables S3 and S4. 10/33 respectively, and p = 0.38 for B1 in Np95KO cells, one-tailed bootstrap tests; Table S4). 255 These findings confirm the ability of the RCP framework to detect methylation 256 established with little or no preference for concordance or discordance, and strengthen 257 our conclusion that cultured stem cells do have substantial contributions from 258 conservative DNA methylation processes (Fig 3). 259 Our inference of some RCP values slightly but significantly greater than 1 in cells 260 deprived of Dnmt1 or its activity hints at some residual maintenance activity 261 contributed by the Dnmt3 methyltransferases. We inferred a complex role for Dnmt3s 262 in our analysis of data from Dnmt3KO lines (S10 Text), cconsistent with earlier 263 suggestions [19,22,33] that the contributions of the Dnmt3s can include low levels of 264 maintenance activity.

11/33
Shifts in RCP Upon Differentiation and Dedifferentiation of Cul-266 tured Cells 267 Our initial examination of RCP values in differentiated cells as compared to cultured 268 stem cells suggested that RCP is altered through the differentiation process (Fig 2 and  two-tailed PTs), and approached the lower boundary of the confidence region inferred 277 for single-copy loci in differentiated somatic cells (Fig 2 and Fig 5a). Using 278 near-genome-wide data for cultured murine cells [15], we inferred significant RCP 279 increases upon cell differentiation for most genomic elements (p < 10 −20 , MLCTs), with 280 the exception of low-complexity and satellite DNAs (Table S5). These RCP increases 281 were greatest at promoters, CG islands, and CG island shores, and are more muted at 282 other regions. We conclude that the onset of differentiation in cultured human and 283 murine cells is usually associated with a shift towards a greater role for conservative 284 processes. 285 We then asked if the dedifferentiation that occurs upon production of an iPS line 286 from a differentiated cell would have opposite effects on concordance preference. We Our inference of ∼4-fold-or-greater preference for concordance in many different 298 cultured ES and iPS cell lines (Fig 3a; Table S3) indicates that it is possible for cultured 299 stem cells to have appreciable levels of conservative methylation. We next asked whether 300 uncultured stem cells taken directly from an embryo also have strong contributions from 301 conservative methylation processes. In so doing, we can also ask whether totipotent 302 stem cells are similar to pluripotent stem cells in their levels of conservative methylation. 303 To investigate these questions, we applied RCP to double-stranded methylation 304 patterns collected by Arand et al. [17] for three multi-copy loci in mouse embryos, L1, 305 mSat, and IAP. Our analyses above indicate that, in general, multi-copy loci have RCP 306 values in cultured stem cells that are lower than those for single-copy loci, and are 307 similar to or lower than many other genomic elements (Figs 2, 3, 4, and Tables S3, S4, 308 S5). Analysis of the three multi-copy loci is therefore expected to provide a conservative 309 estimate of the concordance preference of broader genomic regions.  [17]. For both early embryonic stages and primordial germ cell development, progressions over time in m, U , and subsequently RCP values, are indicated with arrows. In cases where data were available for multiple replicates, the dyad frequencies were pooled. Corresponding estimates of RCP, corrected for bias, are given in the associated tables and in (c). (a) Transitions from early pronuclear stages (1 and 3) to late stages (4-5) were accompanied by a sharp decrease in RCP, to a level similar to that observed in cultured stem cells (Fig 3). Further embryonic development was accompanied by minor increases and subsequent decreases in methylation and RCP values. (b) Primordial germ cells (PGCs) at the earliest stage for which data are available, 9.5 days post conception (dpc), had RCP values at L1 elements even lower than those inferred for Dnmt1-KO cells. RCP values increased during PGC maturation to stage 13.5 dpc even as methylation frequencies decreased by more than 50%. For differentiated gametes, RCP values were similar to those found in other differentiated cells (see Fig 2), while methylation frequencies differed markedly between eggs and sperm. (c) Tracking RCP at multi-copy loci through early embryonic development. We inferred RCP for the various embryonic stages for which methylation data were reported by Arand et al. [17]. Our figure parallels their Fig 6b, which used the percentage of hemimethylated CpG dyads relative to all methylated CpG dyads, rather than the RCPs inferred here. Point estimates and confidence intervals (95%) for all but one data set were estimated by bootstrapping and applying the BCa correction (S7 Text). Double-stranded sequences were unavailable for 3 dpc at mSat, and thus a likelihood-based method was used for the estimation of its confidence intervals (S8 Text). Conversion-error rate estimates and 95% confidence intervals on RCP are given in Tables S4.

13/33
Our analyses for these three multi-copy loci revealed that totipotent embryonic cells, 311 taken from the two-cell to the morula stages (through 3 days post conception (dpc)), 312 also exhibit substantial preference for concordance, and hence, conservative methylation. 313 RCP values ranged from 3.71 to 9.83 over this interval (Fig 6a; Table S4), with the 314 exception of a transiently lower RCP value of 2.03 for the IAP locus in the 315 prereplicative two-cell zygote (Table S4). Data for uncultured pluripotent stem cells 316 from mouse embryos (gastrula, 3.5 dpc), although considerably more limited than for 317 totipotent stem cells, indicate similarly strong preferences for concordant methylation at 318 two of the three multi-copy loci examined, with IAP again being the exception. Thus, 319 relative to random expectations, uncultured pluripotent and totipotent cells from mouse 320 embryos -with only transient exceptions at one locus -do have substantial 321 preferences for concordance. This echoes our findings for cultured pluripotent murine 322 and human ES and iPS cells. We conclude that the level of preference for concordant 323 methylation exists in the embryo and arose neither in the transition from totipotency to 324 pluripotency nor in the establishment and culturing of stem cells (Fig 3).

325
RCP values for specific loci and for most genomic regions are consistently lower for 326 stem cells than for differentiated cells (Fig 2 and Table S4). This high preference for 332 concordance in murine sperm and oocytes implies that the lower RCP values 333 characteristic of zygotes must arise during post-fertilization events.

334
To identify the timing of the transition to the lower RCP values observed in stem 335 cells, we next assessed data from post-fertilization nuclei and cells [17] . Pronuclear 336 stages 1 and 3 revealed high RCP values, similar to those observed in gametes. In 337 pronuclear stages 4-5, however, there was an abrupt transition to lower RCP values in 338 the range observed for totipotent stem cells (Fig 6).

339
Is this transition dependent on a DNA replication event that occurs between 340 pronuclear stages 3 and 4-5? To address this question, we assessed data from Arand et 341 al. [17], in which aphidicolin was used to block DNA replication in the fertilized egg. Our 342 RCP analysis reveals that methylation patterns at L1 and mSat in these treated nuclei, 343 while having somewhat lower RCP values relative to pronuclei at earlier stages, had not 344 undergone the major reduction in RCP that we inferred for unmanipulated pronuclei at 345 stages 4-5 (p < 10 −5 , two-tailed PTs; Table S4). Thus, the shift to lower RCP in stem 346 cells following fertilization appears to require either DNA replication or some later event 347 that is itself replication-dependent. Our conclusion that pronuclear replication is pivotal 348 to marked changes in RCP is consistent with the inference of Arand et al., using a 349 different metric [17], that this replication is critical to changes in methylation dynamics. 350 The change from high RCP values in gametes and early pronuclei to lower values in 351 stem cells was markedly abrupt. Are other major transitions in RCP similarly abrupt, 352 or do some occur gradually, perhaps over many cell divisions? Murine primordial germ 353 cells (PGCs), in their progression to mature gametes, offer an opportunity to approach 354 this question [34,35]. Using data from Arand et al. [17], we inferred that RCP values for 355 PGCs increased by factors ranging from 2.5 to 5 during their progression from 9.5 dpc 356 to 13.5 dpc. This increase was not sudden but gradual over the four-day period, 357 spanning an interval of substantial cell proliferation [34] (Fig 6b). The murine embryo 358 data thus provide examples of both abrupt and gradual transitions in RCP through 359 14/33 development (Fig 6c).

360
The RCP values for PGCs at 9.5 dpc, the earliest stage for which data were reported 361 by Arand et al. [17], were strikingly low. For all three loci studied, RCPs -1.40 for L1, 362 1.57 for mSat, and 2.38 for IAP -were either not significantly different from or lower 363 than those inferred for the same loci in the Dnmt1-KO line (L1 : p < 10 −5 ; mSat: 364 p = 0.78; IAP : p = 0.93; two-tailed PTs; Fig 4). These were among the lowest RCP 365 values we inferred for the developing mouse embryos, and perhaps reflect the epigenetic 366 flexibility required for the production of gametes.

367
Over the first few rounds of PGC division, developmental reprogramming and 368 commitment are occurring [35], and during this time lineage-specific gene expression 369 patterns are arising. The gradual increase in RCP observed across division of these 370 PGCs may mirror the increases in RCP values that occur when cultured ES cells are 371 subjected to differentiating conditions (Fig 5).

372
There is, however, a critical difference between the trajectories for proliferating 373 PGCs and differentiating ES and iPS cells in culture. When cultured cells were 374 differentiated, their methylation frequencies increased. By contrast, methylation 375 frequencies for PGCs declined across rounds of proliferation [36]. Seisenberger et al. [7], 376 Arand et al. [17], and von Meyenn et al. [18] concluded that this reduction in 377 methylation frequency is driven by partial impairment of maintenance methylation.

378
Our RCP framework permits a closer look at the likely extent of this proposed 379 maintenance impairment. If maintenance methylation were completely impaired and no 380 other methylation processes were active, the resulting passive demethylation would be 381 fully dispersive: methyl groups would remain only in hemimethylated dyads. A fully 382 dispersive process such as this should halve the methylation frequency at each DNA 383 replication event and yield an RCP value of 0. Such extremely low values of RCP were 384 inferred only for cells treated with S-Adenosylmethionine-ase (Table S4). As this 385 treatment either impairs or eliminates a cell's ability to methylate DNA, it reveals RCP 386 trajectories that would be observed with complete or nearly complete suspension of all 387 methylation processes. Our finding of RCP values greater than 0, as well as 1, suggests 388 not only that conservative processes remain in maturing PGCs, but also that their 389 contributions are sufficient to outweigh any dispersive effects of passive demethylation. 390

391
The Ratio of Concordance Preference, RCP, quantifies the degree to which exact 392 information is conserved when binary information is transferred in a given system. With 393 an explicit null model for the transfer of information, RCP is a scale for assessing 394 observed preference for concordance as compared to a random expectation, such as the 395 random placement of methyl groups. It thus permits quantitative comparison of 396 information-transfer systems that have disparate binary-state frequencies. Here, we 397 have applied RCP to double-stranded CpG DNA methylation data from cultured and 398 embryonic cells to track and to understand the shifts in their strategies for information 399 transfer, with particular focus on changes across differentiation states and 400 developmental time. The mechanism-free nature of RCP allows the comparison of this 401 DNA methylation property not only across different cell types and developmental stages, 402 but also across organisms whose methylation machineries may differ significantly.

403
Application of RCP to double-stranded DNA methylation patterns enabled us to 404 15/33 answer three questions raised above: (i) in cultured pluripotent and in embryonic 405 totipotent stem cells, preference for concordance is reduced but not eliminated 406 compared with differentiated cells; (ii) cellular differentiation for some classes of 407 genomic elements such as promoters is characterized by increasing preference for 408 concordance, whereas dedifferentiation reflects a decline in preference for concordance; 409 and (iii) concordance preference decreases markedly following fertilization and the first 410 round of DNA replication, but remains substantial through totipotency and early stages 411 of pluripotency; at three multi-copy loci, primordial germ cells initially have nearly no 412 preference for concordance, perhaps reflecting the high level of epigenetic flexibility en 413 route to production of functional gametes.

414
RCP analysis of cultured pluripotent and embryonic totipotent cells provided an 415 opportunity to reassess the balance between conservative and non-conservative processes 416 that contributed to the "noisy" methylation patterns described by Shipony et al. [16]. 417 We found evidence of substantial contributions from conservative processes in stem cells 418 -both cultured pluripotent and embryonic totipotent -using a variety of single-and 419 multi-copy loci, and near-genome-wide data for all-CpG methylation. Our results are 420 thus contrary to the conclusion that "dynamic" processes dominate DNA methylation 421 in stem cells [16].

422
The disparity between the proposal of dynamic methylation by Shipony et al. [16] 423 and our inference of appreciable conservative methylation in cultured stem cells may 424 have arisen from methodological differences in data collection. Inappropriate conversion 425 of methylcytosines can occur during prolonged bisulfite treatment [37], producing 426 artifactual variation among observed patterns [38] (S3 Text). Under methods that 427 quantify diversity among single-stranded patterns but do not account for inappropriate 428 conversion, such errors will lead to overestimation of the role of random as compared to 429 conservative processes.  [18]. 446 We have observed that loci in a given cell type can have disparate RCP values, with 447 the trajectories of RCP changing through time. Significant differences in RCP were 448 observed among loci in individual lines of cultured ES and iPS, and embryonic cells, 449 with particularly pronounced differences between some single-and multi-copy loci 450 (Fig 3b). Furthermore, while the three multi-copy loci we analyzed in data from 451 developing embryos yielded relatively consistent inferences of RCP for each stage, one 452 16/33 locus was observed to show larger RCP shifts during and near totipotency (Fig 6c). 453 Locus-specific features such as imprinting and chromosome location likely contribute to 454 some of this variation. Differences in developmental trajectories of flexibility and 455 stability of DNA methylation may correspond to variation among loci in the timing of 456 their epigenetic sensitivity to environmental conditions. By contrast, the grouping of 457 genome-wide data [15] into functional classes (Table S3) may tend to mute the observed 458 RCP changes during development, as members of each class are likely to differ in their 459 involvement in the processes of differentiation. The six human ES and iPS cell lines for which we collected methylation patterns 477 were derived from either embryos or fibroblasts described as normal [42] or from 478 fibroblasts of individuals with disorders not known to affect the basic biochemistry of 479 DNA methylation.

480
Many of the sets of human DNA methylation patterns analyzed here were presented 481 in previous publications, which include information on University of Washington Human 482 Subjects approval for collection and use. These data include G6PD and FMR1 from 483 leukocytes of normal individuals [22,27]; FMR1 from males with fragile X syndrome [29]; 484 LEP from male leukocytes and from female lymphocytes and adipose tissue [22,28]. Methylation patterns from murine ES cells, and the origin and culturing of these 508 cells, have previously been described [14,15,17,30].

509
Collection of Double-Stranded DNA methylation patterns using 510 hairpin-bisulfite PCR 511 The DNA methylation patterns collected in our lab and analyzed here, both those 512 published previously and those presented here for the first time, were collected using the 513 hairpin-bisulfite PCR approach [13], with barcodes and batchstamps to authenticate 514 each sequence [45]. Details for collection of each data set are given in Table S2. 515 The data presented by Arand et al.] [14,17], and Zhao et al., [15] and analyzed here, 516 were collected in the absence of molecular batch-stamps and barcodes, raising the 517 possibilty that the reliably of those data sets is undermined by PCR clonality. However, 518 both groups used alternate strategies that revealed that PCR clonality was not rampant. 519 Zhao et al. found that essential features of data sets did not differ appreciably between 520 the "real" data set collected conventionally, using PCR, and a test, PCR-free data set 521 that excluded opportunities for clonality by including only one read from each locus,    . RCP values and associated confidence intervals inferred for the 24 data sets from our labs. We collected double-stranded DNA methylation patterns from two species -mouse and human -and several loci using bisulfite conversion under either low-molarity/temperature ("LowMT") or high-molarity/temperature ("HighMT") conditions [38]. For each data set, we counted methylated (M ) , hemimethylated (H), and unmethylated (U ) dyads, and used these values to infer methylation frequency, m, unmethylated dyad frequency, U , and the ratio of concordance preference, RCP. Inferences for (m, U ) and for RCP incorporated correction for conversion error, as described in S3 Text. We estimated 95% confidence intervals by bootstrapping and applied the BCa correction to obtain bias-corrected confidence intervals and point estimates, as described in S7 Text. Both uncorrected and BCa-corrected intervals and point estimates are listed here.   [15]. Dr. Hehuang Xie kindly shared raw data on M , H, and U values for samples described in Figure S6 of their paper, and also provided information on error rates from bisulfite conversion: for the "Day 0" sample (cultured stem cells grown under non-differentiating conditions), the failed conversion rate was 0.011, and the inappropriate conversion rate was 0.0109; for the "Day 6" sample (cultured cells grown for 6 days under differentiating conditions), the corresponding error rates were 0.012 and 0.0099. Dr. Xie also commented, "'All' refers to the total CG dyads and 'DNA' refers to the CG dyads within 'DNA repeat elements (DNA)' annotated in the UCSC genome database." For these data sets, whose individual sequence reads only contain 2 to 3 dyads on average, we assumed independent sampling of dyads to obtain the confidence intervals (S8 Text).  [17]. A color gradient encodes approximate levels of significance, with dark green indicating non-significant differences, and red indicating highly significant differences.
23/33 S1 Text: Deriving the Ratio of Concordance Preference, RCP The goal of RCP is to quantify the extent to which the DNA methylation machineries that gave rise to each data set deviate from random expectations under the binomial distribution, as indicated by an over-or under-abundance of concordant dyads. Here we approach this problem by modeling the formation of concordant and discordant dyads as transitions between unmethylated and hemimethylated dyads and between hemimethylated and fully methylated dyads, without regard to the molecular processes that facilitate these transitions. We then take a mathematical approach to derive an expression for RCP.
We seek the equilibrium frequencies of fully methylated (M ), hemimethylated (H), and unmethylated (U ), dyads. Consider a continuous-time Markov chain operating on the probability distribution of the dyads M, H, U : where η's and γ's represent the rates of methylation addition and removal, respectively, as shown in Fig S1. In the meantime, we define RCP as the ratio between β c , the rate of dyad conversions yielding concordant dyads, and β d , the rate of dyad conversions yielding discordant dyads. We thus define β c = √ η 1 γ 0 , the geometric mean of the methyl-addition and methyl-removal rates yielding concordant dyads. Likewise, we define β d = √ η 0 γ 1 .
We can then solve for the steady state distribution for the Markov chain in Equation (2) to arrive at the ratio. We can also express in in terms of m and U , as shown below.
This formulation of RCP does not require the assumption that methylation frequency is constant over time. Here, "steady state" refers to the dyad frequencies expected under a given system of methylation processes, regardless of whether the methylation frequency is constant over rounds of cell division.
It is notable that RCP 2 is 4M U/H 2 , which is expected to equal 1 under the Hardy-Weinberg equilibrium [23], if dyad frequencies are considered as genotype frequencies of a gene with two alleles. Following this, RCP can be considered as a measure of deviation from the null equilibrium.
RCP is therefore a metric for the degree to which the system of methylation processes prefers concordance (RCP > 1), discordance (RCP < 1), or, possibly, exhibits no preference in either direction (RCP = 1). If we set RCP = 1 and solve this expression for U , we find that U = (1 − m) 2 . This is consistent with the expectation under the binomial distribution that RCP will be 1 when there is no preference for either concordance or discordance. As RCP approaches ∞, U approaches 1 − m. Setting RCP = 0 results in two solutions: U = 1 − 2m and U = 0. These solutions are congruent with the boundaries that define the space of (m, U ) as given in "Defining the Ratio of Concordance Preference for All Possible Methylation Configurations" of the main text, and in Fig 1.

S2 Text: Comparing RCP Values and Hemi-Preference Ratios from HMM
"Hemi-preference ratio", a parameter inferred under our earlier analysis with a hidden-Markov model (HMM) [22], evaluates the preference of a given DNA methyltransferase for acting at hemimethylated as compared to unmethylated dyads [11], and thereby measures its preference for creating concordant dyads. Because RCP measures the concordance preference of the entire ensemble of enzymes that give rise to methylation patterns, the hemi-preference ratio of a given enzyme and the RCP inferred from the same data set are expected to have good agreement if that enzyme is the primary actor. The congruence between these two metrics is expected to decline with increasing contributions from other enzymes.
Three of the four data sets that had been analyzed under HMM showed very good agreement between the RCP values we infer here and the hemi-preference ratios previously inferred for the maintenance methyltransferase DNMT1: 58.0 vs. 58 for FMR1, 13.3 vs. 15 for G6PD, and 89.1 vs. 94 for LEP [22] (Table S1). The close correspondence between these values indicates that for these loci in leukocytes, methylation dynamics are driven primarily by conservative, maintenance-type processes such as accomplished by DNMT1, and that neither active demethylation nor de novo processes have a substantial role.
The fourth data set that had been analyzed under HMM, LEP in human adipose tissue, had an inferred RCP value of 34, well within the range of RCPs inferred for other data sets from single-copy loci including LEP in leukocytes (Fig 1a; Table S1). This RCP value was, however, more than eighteen-fold lower than the DNMT1 hemi-preference ratio estimate of 628 that we obtained under the earlier HMM approach (Table S1). What might account for this discrepancy? A hemi-preference ratio of 628 is unrealistically high, even for a maintenance enzyme, compared to the hemi-preference ratios inferred in other data sets, including those from methylation patterns established by DNMT1 in vitro [11]. This could reflect the inability of the HMM to yield a reasonable estimate for a data set impacted by demethylation, a process that was not considered in the HMM design. This lack of correspondence between HMM and RCP estimates is consistent with the possibility that active removal of methylation has a heightened role at loci with temporally variable transcription levels, and may reflect the role of the LEP locus as a sentinel of adipose but not blood [28].

S3 Text: Assessing the Potential Impact of Bisulfite-Conversion Errors
Two classes of error that occur during bisulfite-conversion can lead to misinterpretation of cytosine methylation states. Failure to convert unmethylated cytosines to uracil occurs at rate b, and can result in overestimation of methylation frequency. Inappropriate conversion of methylated cytosines to thymine, first noted by Shiraishi and Hayatsu,[46] occurs at rate c, and can result in underestimation of 25/33 methylation frequency. We have reported previously that the rates of failed and inappropriate conversion depend strongly on the chemical and thermal conditions of bisulfite conversion. [38] In particular, we found that bisulfite treatment prolonged beyond that required to attain complete or nearly complete conversion of unmethylated cytosines can yield high rates of inappropriate conversion. Historically, conversion protocols have been designed to minimize the failed-conversion rate, with little or no attention to the rate of inappropriate conversion events. Thus, while both classes of error can alter parameters used to infer RCP -the methylation frequency, m, and the unmethylated dyad frequency, U -errors arising through inappropriate conversion are of particular concern.
How severely can conversion error impact RCP? And to what extent does its potential impact depend on the true methylation frequency of the target sequences? Densely methylated sequences contain a large number of fully methylated dyads, such that the most likely conversion error is inappropriate conversion yielding apparent hemimethylated dyads. Such events elevate the apparent level of discordance in a given data set and artifactually depress the inferred RCP values. To quantify this potential impact, we calculated how RCP would be altered by a single artifactual hemimethylated dyad introduced by inappropriate conversion, assuming that in its true state the relevant data set had one unmethylated dyad, one hemimethylated dyad, and 98 methylated dyads. The inappropriate conversion of one cytosine among 98 dyads whose true state is fully methylated is close to the number expected for an inappropriate conversion rate of about 0.5%. For this hypothetical data set, the introduction of a lone artefactual hemimethylated dyad by an inappropriate conversion event reduces RCP from about 20 to about 10, i.e., by a factor of 2. Thus, in the absence of mathematical corrections of the sort implemented here, even very low levels of inappropriate conversion can severely impact inference of RCP.
Sparsely methylated sequences contain a large number of unmethylated cytosines that are potential targets for failed conversion. We calculated that for such sequences the impact of conversion error depends on whether most unmethylated cytosines are in hemimethylated or in unmethylated dyads. For example, when most unmethylated cytosines in a sparsely methylated sequence occur in hemimethylated dyads, the level of discordance is already high, such that the addition of an artifactual hemimethylated dyad by failed conversion only slightly decreases RCP. By contrast, when most unmethylated cytotosines are in unmethylated dyads, production of an artifactual hemimethyated dyad by failed conversion reduces RCP by a factor of two. Mathematical correction for conversion error is therefore critical, not only because error has potentially large impacts on absolute RCP values, but also because the magnitude of these impacts differs among data sets, with the potential either to magnify or to dampen variation among them.

S4 Text: Mathematical Correction for Bisulfite-Conversion Error
To account for conversion errors occurring at known rates, we express the observed dyad frequencies, M f obs , H f obs , and U f obs , as functions of the true dyad frequencies M t , H t , and U t , that would have been observed had conversion error not occurred: In cases where the mathematical correction yielded negative counts for one or two dyads, the negative counts were redistributed to the remaining dyads in proportion to their original counts, such that no dyad counts were negative after correction.
For each of the data sets reported here, DNAs were converted under one of two conditions: low molarity-temperature (LowMT), with failed-conversion rate 0.0030 and inappropriate conversion rate 0.031, and high molarity-temperature (HighMT), with failed-conversion rate 0.0086 and inappropriate-conversion rate 0.017 [38]. Error rates used in analysis of published data from other groups are given in Table S4 (Arand et al. [14,17]) and Table S5 (Zhao et al. [15]). Using these rates and observed dyad frequencies, true dyad frequencies can be inferred by maximum likelihood.

S5 Text: Assessment of Heterogeneity Among Replicates and Quasi-Replicates
To ask about possible heterogeneity among RCP values inferred for replicates -for example, for individual developmental stages as investigated by Arand et al. [17] -and for quasi-replicates -for example, for multiple cell lines at similar cell-culture and differentiation conditions -we utilized pairwise, two-tailed permutation tests.
These calculations revealed an intriguing pattern. There was no evidence of heterogeneity among samples of cultured human ES and iPS cells (p > 0.29, pairwise two-tailed PTs for the six cultured stem cell lines at the L1 locus; p-values summarized in Table S6; Fig 3). By contrast, several pairwise comparisons of mouse embryonic cells sampled independently for a given developmental stage yielded evidence of significant differences in RCP (p-values summarized in Table S6). Notably, however, most of the significant heterogeneity observed was for cells in pronuclear stages, at which data collection at identical stages is made difficult by rapid developmental transitions. On the other hand, there was evidence of only limited heterogeneity among totipotent and pluripotent cells between the 2-cell stage and the 3.5-dpc stage. Similarly, when experimental interventions were taken to prevent DNA replication or methylation (i.e. aphidicolin treatment at PN4/5 stages and SAMase treatment just after fertilization and before the first round of DNA replication), no evidence of heterogeneity was observed. Cell culture techniques are focused on minimizing opportunities for developmental differences to arise among cells, and could account for the observation of no heterogeneity among the cultured human ES and iPS cells.
Given the low levels of heterogeneity observed and the fact that some level of heterogeneity is inevitable in rapidly developing cells, we pooled several sets of replicates for analysis, as described in S6 Text (Fig 6).

S6 Text: Pooling Data Sets Across Replicates
In some instances, multiple independent replicates were collected and used to assess RCP for cells under a given biological condition -for example, multiple embryos at a 27/33 given pronuclear stage (Fig 6). In these cases, the sequence-level dyad counts of individual replicate data sets were corrected for conversion errors independently before the replicate data sets were pooled.

S7 Text: Inferring and Comparing RCP Without Assuming Independent Sampling of Dyads
CpG/CpG dyads typically are not sampled individually, but instead as members of sequence reads that can contain from a few to many neighboring dyads. Often, there is correlation among the methylation states of these neighboring dyads. The processivity of the DNA methyltransferases, especially DNMT1, is a substantial contributor to these correlations. As the mean number of dyads per read increases, so too does the potential for dyad-dyad correlation to undermine the accuracy of confidence intervals inferred under the assumption of independent sampling of dyads.
Of the three groups of data we analyzed here, one -Zhao et al. (2014) [15] consists of reads that contain, on average, only two to three dyads, such that correlation among the dyads ascertained on a given molecule is anticipated to have a fairly minor impact on sampling. Reads for our data have as many as 22 dyads; mean dyad counts for data from Arand et al. [14,17] are intermediate between our data and those of Zhao et al. (2014) [15].
For our own data and those of Arand et al. [14,17], we used an approach that, at slightly higher computational cost, models and seeks to account for the potential impacts of dyad-dyad correlations. Our data yielded moderately larger confidence intervals under the bootstrapping approach as compared to under the likelihood approach with the assumption of independent sampling of dyads. By contrast, the data from Arand et al. [14,17] yielded almost identical confidence intervals without and with the assumption of independence. In view of the even smaller mean number of dyads per read in the data of Zhao et al. (2014) [15], we chose to make the not unreasonable assumption that dyads were sampled independently in their data. Although it is possible that two or more reads originated from nearby regions of a single molecule and thus have dyad-dyad dependence, we assumed that the effect of such occurrences, if at all present, is very small, given the large amount of starting material.
The first of the two methods is described in this section, and the second in the next section, S8 Text.
Inferring RCP point estimates and confidence intervals. For our own data and those of Arand et al., we used a bootstrapping approach to model the uncertainty in RCP values introduced by possible within-sequence correlations in methylation states, and to make point estimate and confidence-interval inferences that account for this uncertainty. A more rigorous examination of correctness for our bootstrapping approach is in process.
For each data set of n sequences, we sampled n sequences with replacement, B = 2,000,000 times, and calculated the RCP for each of the B bootstrapped data sets. We inferred the true distribution of RCP for a given biological condition using the bootstrap distribution. It was clear from the resulting distributions that RCP is a biased estimator, as many of these distributions had longer right tails than left.
The simplest, and potentially misleading, approach for inference of point estimates 28/33 and construction of confidence intervals from bootstrap distributions is to assume normality, and then to exclude right and left tails at the intended level of confidence. Efron and DiCiccio (1996) [47] commented that, for biased estimators, this approach can lead to inference of inappropriately exclusive limits at the long-tailed end of the distribution, and inappropriately inclusive limits at the right-tailed side.
To address this problem, we applied Efron and Diccicio's "bias-corrected and accelerated" (BCa) method. Under the BCa method, the cumulative-density function observed for the distribution of bootstrap replicates is compared to that expected under normality. Bias-corrected point estimates are then inferred as the 50th-percentile values in the BCa corrected distributions. Similarly, critical points for intervals of a given confidence level are inferred from the values at the relevant percentiles of the distribution of bootstrapped values.
Assessing whether a data set has RCP value greater than 1. If methyl groups are placed completely at random -that is, with preference for neither concordance or discordance -RCP is expected to be 1. In a previous report, Shipony et al. [16] interpreted their data to indicate that methyl groups are, indeed, placed essentially at random in undifferentiated cells. As there is very little evidence in any of our data sets to indicate possible preference for discordance, we opted to perform a one-tailed test, asking for the probability that our data sets do not have RCP values greater than 1.
To do so, we first calculated RCP point estimates for 200,000 bootstrap replicates, using the method described above, and then calculated the p-value as the fraction of those point estimates that were less than or equal to RCP of 1. For example, the finding that only 20 of the 200,000 bootstrap replicates yielded an RCP point estimate less than 1, we would conclude that, for this data set, RCP is significantly greater than 1 (p = 0.0001).
Assessing whether RCP values differ significantly between data sets. A key goal of our study is to assess possible evidence for RCP differences between data sets. For example, we ask whether RCP for a given cell type differs between samples collected under differentiating as compared to non-differentiating conditions.
To compute the significance of observed differences between RCP values for Data Sample A and Sample B, we used permutations to compute a null distribution of ordered differences (for example, RCP(Sample A) − RCP(Sample B)) expected under the null assumption that the sequences in the two sets were drawn from a single population. To do this, we pooled sequences from Sample A and Sample B and then repeatedly drew from that pool, without replacement, to generate, at random, versions of Sample A and Sample B with sequence counts equal to those of the observed data. For each pair of randomly generated sets, we calculated the difference between their RCP point estimates (θ * ) , and obtained the distribution of RCP-difference values under the null hypothesis. For one-tailed tests, for example, to ask whether one data set has RCP value greater than the RCP value for another, we took the proportion of the distribution greater than the difference of the point estimates (θ) as the p value. For two-tailed/equal-tailed tests, for example, to ask whether RCP values for two data sets differ significantly, we first calculated the proportion of permuted differences smaller than (θ). We then calculated the proportion of permuted differences greater than (θ). We took the twice the lesser of these two values as the p-value for observing a difference this great in the event that the sequences in the two data sets were, in reality, drawn from the same distribution.
Defining the 95% confidence region for a data set in the (m, U ) space, without assuming independent sampling of dyads. Using the methods described above, we generated 2,000,000 bootstrap samples of the data set. Instead of estimating the RCP value for each of the bootstrap samples, we calculated m and U . We constructed the two-dimensional confidence region for the two parameters for plotting using ci2d function in the gplots R package.
S8 Text: Inferring and Comparing RCP With Assuming Independent Sampling of Dyads Calculating the likelihood of proposed true dyad frequencies, M t , H t , and U t , given the observed dyad counts, M c obs , H c obs , and U c obs . Failed-and inappropriate-conversion events create observed dyad frequencies that differ from true dyad frequencies. If the rates of these two types of error are known, the likelihood of a set of proposed true frequencies -M t , H t , and U t -can be calculated as follows, given the observed dyad counts -M c obs , H c obs , and U c obs -and Equation (4). M f obs , H f obs , and U f obs , which indicate the observed frequencies of the dyads, can be easily calculated from the dyad counts.
Inferring RCP point estimates and confidence intervals. The RCP point estimate of a data set is calculated directly from the conversion-error-corrected observed dyad frequencies. We determine the 95% confidence interval for RCP as the interval that includes all values of RCP for which the natural log likelihood lies within Although bias is just as much a concern with the assumption of independent sampling of dyads, we did not perform a bias correction for the data from Zhao et al. [15], because without bootstrapping, we lacked a simple method for bias estimation. Nonetheless, our observations from our analyses of other data sets suggest that most of these samples are likely not severely affected by bias. From bootstrapping of smaller data sets collected by our lab and by Arand et al. [14,17], we have observed that the asymmetry in the distribution is small in the lower ranges of RCP (1∼10), but greater as RCP increases. Therefore, bias is likely to be small for most samples presented by Zhao et al., for which RCP point estimates rarely exceed 10. For data sets presented by Zhao et al. that yielded RCP estimates greater than 10, bias are unlikely to affect our analyses and the general conclusion that their methylation behavior is characterized by strong preference for concordance.
Assessing whether RCP values differ significantly between data sets. To compare the RCP values between two data sets while assuming independence among all dyads, we can use a likelihood approach, comparing a model in which two true RCP values are required to described the two data sets to an alternate model in which both data sets can be explained by a single RCP value. We implemented that test as follows: Under the assumption of large sample of dyads, D is approximately χ 2 distributed with 1 degree of freedom.
Assessing whether a data set has RCP value greater than 1. We again take a likelihood-based approach as we did in the section above. Here, the null model states that the system operates under the specified RCP value. The alternate model states that the system operates under another RCP value.
Under the assumption of a large sample of dyads, D is approximately χ 2 distributed with 1 degree of freedom.
Defining the 95% confidence region for a data set in the (m, U ) space. We determine the 95% confidence region in the space of two parameters -here m and U -as the region that includes all proposed pairs of parameter values for which the natural log likelihood lies within χ 2 0.95; df =2 = 3.00 units of the maximum natural log likelihood point estimate [48].

S9 Text: Could Spontaneous Differentiation of a Subset of ES
and iPS Cells Substantially Influence the Inference of RCP?
One possible explanation for the inference of conservative processes operating in cultured, undifferentiated cells is that these cells may in reality be a mixture of differentiated and undifferentiated cells. We calculated how large the differentiated subpopulation would need to be under this scenario, given that the subpopulation operated with the RCP inferred for the corresponding differentiated cells. The remainder of the population was assumed to operate at RCP of 1, per the alternate hypothesis. We allowed specification of m for each of the two populations.
Let p 1 and p 2 represent the proportions of the two putative subpopulations that we wish to estimate, such that p 1 + p 2 = 1. We start with RCP and m for each of the two subpopulations, which we denote by RCP 1 , RCP 2 , m 1 , and m 2 . We can find U 1 and U 2 using the RCP and m values. We have the following set of equations: Using these equations and the observed overall RCP value, and the expression for RCP given m and U (Equation (3)), we can find p 1 and p 2 .
Assuming that m of the undifferentiated subpopulation is at least 0.2, we found that over half of the cells would need to be differentiated if methylation in the remaining 32/33 undifferentiated cells operates at RCP of 1. Morphological inspection of the cultured human stem cells did not suggest the presence of a substantial subpopulation of differentiated cells. Additionally, this calculation is not critical for the work presented here, because RCP values for totipotent and pluripotent stem cells from murine embryos are very similar to those inferred for all cultured ES and iPS cell lines that we examined. Such calculations may, however, be useful for future evaluation of cultured cells that are early in the process of differentiation. S10 Text: Comparing RCPs of Dnmt3 -Knockout Lines with Those of Wildtype Lines The relative contributions of conservative processes are expected to be more substantial when the fraction of methylation events achieved through maintenance-type activity is elevated through loss of one or both of the de novo enzymes. Indeed, for the Lep locus, we inferred RCP values of 8.11 (Dmnt3a KO) and 6.97 (Dnmt3b KO), both higher than the RCP value for the wildtype murine ES cell line (p = 0.008, Dnmt3a KO; p = 0.04, Dnmt3b KO) (Fig S2a; Table S3).
Indeed, at the Lep locus, we inferred RCP values of 8.11 for Dmnt3a KO cells, a value significantly higher than that for wildtype cells (p = 0.046, two-tailed PT; Fig S2a; Table S3). Although the point estimate was higher for Dmnt3b KO cells as well, the difference was not significant (p = 0.11, two-tailed PT).
Examination of data from Arand et al. [14] for a broader set of loci in cell lines with knockouts at one or both of the Dnmt3s, however, revealed a more complex role for these enzymes in shaping methylation concordance. Overall methylation levels were diminished in the absence of the de novo methyltransferases 3a and 3b, as expected (Fig S2b-i, Table S3); RCP values increased as expected for several, though not all loci. Results for cells that lacked only one of the two DNA methyltransferases were even more variable across loci (Fig S2, Table S3). In some cases, loss of a single Dnmt3 enzyme had the predicted impact of increasing RCP (L1 in 3a KO, p < 10 −6 , Fig S2h; L1 in 3b KO, p = 0.002, Fig S2h; IAP in 3a KO, p = 0.002, Fig S2g; IAP in 3b KO, p = 0.029, Fig S2g; two-tailed PTs), as was observed for Lep in our data. For one locus, B1, knockout of both of the Dnmt3s resulted in significantly increased RCP (p < 10 −6 , two-tailed PT), though neither of the single knockouts did (Fig S2f). Most surprisingly, for two loci, loss of either of the Dnmt3s decreased RCP (Igf2, p < 0.016, Fig S2c; Igf2, p < 10 −6 , Fig S2c; two-tailed PTs). These counterintuitive results likely reflect variation across loci in the roles of the individual DNA methlytransferases -and possibly the demethylation machinery -in shaping overall methylation levels across loci and categories of genic elements [49].