Human Genome Replication Proceeds through Four Chromatin States

Advances in genomic studies have led to significant progress in understanding the epigenetically controlled interplay between chromatin structure and nuclear functions. Epigenetic modifications were shown to play a key role in transcription regulation and genome activity during development and differentiation or in response to the environment. Paradoxically, the molecular mechanisms that regulate the initiation and the maintenance of the spatio-temporal replication program in higher eukaryotes, and in particular their links to epigenetic modifications, still remain elusive. By integrative analysis of the genome-wide distributions of thirteen epigenetic marks in the human cell line K562, at the 100 kb resolution of corresponding mean replication timing (MRT) data, we identify four major groups of chromatin marks with shared features. These states have different MRT, namely from early to late replicating, replication proceeds though a transcriptionally active euchromatin state (C1), a repressive type of chromatin (C2) associated with polycomb complexes, a silent state (C3) not enriched in any available marks, and a gene poor HP1-associated heterochromatin state (C4). When mapping these chromatin states inside the megabase-sized U-domains (U-shaped MRT profile) covering about 50% of the human genome, we reveal that the associated replication fork polarity gradient corresponds to a directional path across the four chromatin states, from C1 at U-domains borders followed by C2, C3 and C4 at centers. Analysis of the other genome half is consistent with early and late replication loci occurring in separate compartments, the former correspond to gene-rich, high-GC domains of intermingled chromatin states C1 and C2, whereas the latter correspond to gene-poor, low-GC domains of alternating chromatin states C3 and C4 or long C4 domains. This new segmentation sheds a new light on the epigenetic regulation of the spatio-temporal replication program in human and provides a framework for further studies in different cell types, in both health and disease.


Introduction
Understanding the role of chromatin structure and dynamics in the regulation of the nuclear functions including transcription and replication, is a major challenge of current research in genomics and epigenomics [1][2][3][4][5][6][7]. Since the initial sequencing of complete genomes and more than a decade ago of the human genome [8], the development of new techniques, in particular chromatin immunoprecipitation (ChIP) followed by massive parallel sequencing (ChIP-seq) [9], has enabled genome-wide analysis of many epigenetic modifications such as histone modifications, histone variant incorporation as well as of various DNA-binding proteins [6]. These techniques have been extensively applied to various eukaryotic genomes, from budding yeast [10], to plants [11,12], worm [13], fly [14,15], mouse [6,16,17] and human [6,[16][17][18], and have led to significant progress in our understanding of the chromatin landscape and its impact on gene regulation, replication origin specification and cell differentiation. Statistical analyses of these multivariate data sets have shown that this huge combinatorial complexity can be reduced to a surprisingly small number of predominant chromatin states with shared features namely four in Arabidopsis thaliana [19], five in Caenorhabditis elegans [20] and four [21] or five [22] in Drosophila. To our knowledge, no such a drastic dimensional reduction has been reported in mammalian organisms so far. The application of a multivariate Hidden Markov Model (HMM) [23] as well as the implementation of adapted pattern-finding algorithm [24], have confirmed that distinct epigenetic modifications often exist in well-defined combinations corresponding to different genomic elements like promoters, enhancers, exons, repeated sequences and/or to distinct modes of regulation of gene expression such as actualy transcribed, silenced and poised [23][24][25][26]. Some recent study [27] of chromatin mark maps across nine different human cell types has ultimately identified fifteen main chromatin types which is a relatively limited number of epigenetic states but probably not the optimal complexity reduction one may achieve in human and more generally in mammalian genomes. The analysis of a wide set of chromatin regulators that add, remove or bind histone modifications reported in Ref. [28], is a very encouraging step in this direction since six major groups or modules of chromatin regulators were shown to encompass the combinatorial complexity and to be associated with distinct genomic features and chromatin environments.
How epigenetic mechanisms and gene expression coordinate with DNA replication has been a long-standing question [1][2][3][4][5][6]. On the contrary to bacteria, yeast and viruses, the genomes of multi-cellular eukaryotes have no clear consensus DNA sequence element associated with replication initiation [29,30]. Metazoan genomes duplicate through the coordinated activation of hundreds to thousands of replication origins that can be extremely sitespecific or poorly defined with a broad site specification [31]. Indeed more origins are prepared in G1-phase than actively needed in S-phase [32]. Epigenetic mechanisms very likely take part in the spatial and temporal control of origin usage and efficiency in relation with gene expression [32][33][34][35][36][37]. For many years, elucidating the determinants that specify replication origins has been hampered by the very limited number of well established origins in human and more generally in mammals (a few tens versus a few ten thousands expected) [4,32,36,38]. Only very recently, nascent DNA strands synthetized at origins were purified by various methods to map replication origins genome-wide in different eukaryotic organisms including Arabidopsis thaliana [39], Drosophila [40], mouse [40,41] and human [18,[42][43][44][45][46][47]. Despite some inconsistency or poor concordance between certain of these studies [4,48], some general trends have emerged confirming the correlation of origin specification with transcriptional organization [3,4,32]. The set of replication origins identified so far are strongly associated with annotated promoters and seem to be enriched in transcription factor binding sites [43,44,49] and in CpG islands [40,41,43]. However a significant proportion of origins do not seem to be controlled in the same way as gene transcription since they are in regions void of DNase-I-hypersensitive sites (DHSs) and of histone marks found at active promoters [3,43]. Interestingly, it has been recently reported that replication origins may contain specific nucleotide sequences. Actually G-rich consensus motifs were shown to be associated with Drosophila, mouse and human origins [40,47,50]. These analysis have opened new perspectives towards the identification of mechanisms governing origin selection in mammals.
The recent blooming of genome-wide mean-replication timing (MRT) data in yeast [51], plants [52], worm [13], fly [53,54], mouse [55][56][57] and human [58][59][60][61] has provided the opportunity to establish links between the spatio-temporal program of replication, transcription and chromatin structure [3][4][5][6]62]. It is now well established that in higher eukaryotes, there is a significant correlation between GC-rich and gene-rich regions replicating early in the S-phase and in between AT-rich and gene poor regions replicating late [55,58,62]. But recent studies in mammals [56,59] and Drosophila [63], have shown that during differentiation, some genes change expression without change in MRT and vice versa, thereby indicating that transcription is not the only controlling factor and that the epigenetically regulated chromatin structure is likely to be part of the game [3,4,6,62]. In good agreement with previous studies in Drosophila [22,63], genomewide MRT profiles along mouse and human chromosomes in different cell lines reveal a correlation with epigenetic modifications [64]. Early replicating regions tend to be enriched in open chromatin marks H3K4me1, H3K4me2, H3K4me3, H3K36me3, H4K20me1 and H3K9 and H3K27 acetylation, whereas late replicating zones are mostly associated with H3K9me2 and to a lesser extent with H3K9me3 [56,65]. Importantly, the dynamic changes in MRT observed during development come along with some subnuclear repositioning [56,57,[65][66][67][68][69], early replicating euchromatin domains being generally at the interior of the nucleus whereas late replicating heterochromatic domains are more peripheral or near nucleoli [69][70][71][72][73]. Recent experimental studies of long-range chromatin interactions using chromosome conformation capture techniques [65,[74][75][76] have confirmed that 3D chromatin tertiary structure plays an important role in regulating replication timing. In particular, replicon size, which is dictated by the spacing between active origins, correlates with the length of chromatin loops [37,77,78]. But as questioned in Refs [76,79,80], the dichotomic picture proposed in early studies [65,74,75], where early and late replicating loci occur in separated compartments of open and closed chromatin respectively, is somehow too simple as previously questioned in a detailed analysis of replication fork velocity [79]. Identifying the epigenetic chromatin regulators of the spatio-temporal program of DNA replication will be a formidable step towards understanding the so-called replicon and replication foci [71,[81][82][83][84] in relation with their transcription counterpart, the transcription factories [71,[84][85][86].
Here we perform principal component analysis (PCA) [87] and classical clustering [88] on thirteen epigenetic mark maps in the K562 immature myeloid human cell line (the results of a similar analysis for the lymphoblastoid GM12878 cell line are reported in the Supplementary Data) at the resolution 100 kb of corresponding available MRT data, with the perspective of identifying the major types of chromatin states in relation with replication timing during S-phase. For this comparative analysis, we use as a guide the so-called replication U-domains that were shown to cover about half of the human genome for 7 different human cell types including ES, somatic and HeLa cells [80,89]. In these megabasesized domains, the MRT has a characteristic U-shape with early initiation zones at the borders and late replication at centers. Remarkably a significant overlap is observed between these replication U-domains in different cell types and also with the so-called skew N-domains [90][91][92], where the compositional skew profile accumulated in the germline can be decomposed into a replication-associated linearly decreasing component that shaped as a N [92][93][94] and a step-like transcription associated component that increases in magnitude with transcription and changes sign with gene orientation [92,93,[95][96][97]. From the demonstration that the average replication fork polarity is directly reflected by both the compositional skew and the derivative of the MRT [80,98,99], it has been argued that the experimental observation of a MRT derivative that behaves as a N inside replication U-domains is the signature of a progressive inversion of replication fork polarity. These large-scale gradients of replication fork polarity in somatic

Author Summary
Previous studies revealed spatially coherent and biologicalmeaningful chromatin mark combinations in human cells. Here, we analyze thirteen epigenetic mark maps in the human cell line K562 at 100 kb resolution of MRT data. The complexity of epigenetic data is reduced to four chromatin states that display remarkable similarities with those reported in fly, worm and plants. These states have different MRT: (C1) is transcriptionally active, early replicating, enriched in CTCF; (C2) is Polycomb repressed, mid-S replicating; (C3) lacks of marks and replicates late and (C4) is a late-replicating gene-poor HP1 repressed heterochromatin state. When mapping these states inside the 876 replication U-domains of K562, the replication fork polarity gradient observed in these U-domains comes along with a remarkable epigenetic organization from C1 at U-domain borders to C2, C3 and ultimately C4 at centers. The remaining genome half displays early replicating, gene rich and high GC domains of intermingled C1 and C2 states segregating from late replicating, gene poor and low GC domains of concatenated C3 and/or C4 states. This constitutes the first evidence of epigenetic compartmentalization of the human genome into replication domains likely corresponding to autonomous units in the 3D chromatin architecture. and germline cells initiate from early initiation zones, also called ''master'' replication origins [100,101], at U/N-domain borders that were found to be hypersensitive to DNaseI cleavage, to be associated with transcriptional activity and to present a significant enrichment in the insulator-binding proteins CTCF, the hallmarks of localized (,200-300 kb) open chromatin structure [80,101]. The analysis of chromatin interaction HiC [80] and 4C [76] data have revealed that these replication U/N-domains indeed correspond to high-order self-interacting chromatin units. The additional observation of a remarkable gene organization inside these domains with a significant enrichment of expressed genes nearby the bordering ''master'' replication origins [92,102] sheds light on these U/N-domains as regions of highly coordinated regulation of transcription and replication by the chromatin structure. These structural and functional units are conserved in mouse [91,92] and are robust to chromosome rearrangements [103] which indicates that they are likely to be a major determinant of genome evolution [104].

Combinatorial analysis of chromatin marks
We investigated relationships between the genome-wide distributions of eight histone modifications, one histone variant and four DNA binding proteins in the immature myeloid human cell line K562 (Materials and Methods) at the 100 kb resolution of corresponding MRT data [61,80]. As a first step, we computed the Spearman correlation coefficient of each mark with each other. We next represented the resulting matrix as a heat map after having reorganized rows and columns with a hierarchical clustering based on the Spearman correlation distance (Equation 1, Fig. 1). This preliminary analysis was very promising as regards to the possibility of reducing combinatorial complexity. All the epigenetic marks that are known to be involved in transcription positive regulation, namely H4K20me1, H3K9me1, H3K4me3, H3K27ac, RNAPII, CBX3, H2AZ, H3K79me2, H3K36me3, together with the transcription factors CTCF and Sin3A, form a block in the correlation matrix, meaning that they are all correlated with each other. The maximum correlation is actually obtained between the two active promoter marks H3K4me3 and H3K27ac. As suggested in Refs [27,105], all these active marks are likely to occupy similar regions in the genome. In fact, two lines are clearly apart on the hierarchical clustering dendrogram (Fig. 1). They correspond to the repressive chromatin marks H3K27me3 and H3K9me3 that are respectively associated with the so-called facultative and constituve heterochromatins [105,106]. These two marks are recognized by the chromodomains of polycomb (Pc) proteins and heterochromatin protein 1 (HP1), respectively, components of distinct gene silencing mechanisms which likely explains that they are strongly anticorrelated with each other. While H3K9me3 behaves quite independently with respect to most of the active chromatin marks, H3K27me3 correlates to some of them and especially to H4K20me1, H3K9me1 and CTCF. When further investigating the correlations between the thirteen considered chromatin marks and the MRT (Fig. 1), we found, consistently with previous works [56,59,61,64,65], a strong correlation for the transcriptionally active marks with early replication. Some moderate correlation was obtained for the Pc associated repressive marks H3K27me3 which contrasts with the significant anticorrelation observed for the constitutive heterochromatin mark H3K9me3 with late replication.
In a second step, to objectively identify the prevalent combinatorial patterns of the thirteen chromatin marks, we performed a PCA [107] to reduce the dimensionality of the data (Materials and Methods). We then concentrated on the first three principal components, which together account for 76% of the total data set variance ( Supplementary Fig. S1). By projecting the 100 kb genomic loci on the (PC1, PC2) plane ( Fig. 2A) and the (PC3, PC2) plane (Fig. 2B), we noticed that four areas contain most of the population. On the (PC1, PC2) plane, a large area of medium density comes out from a plane of much higher density. As viewed on the (PC3, PC2) plane, in this very dense plane, loci mainly lie along two straight lines with a very high density of loci concentrated at the intersection of these lines. This led us to use the Clara clustering algorithm [88], which is very similar to kmeans, with the number of clusters fixed to four (Materials and Methods). When labeling each of the four main chromatin states with a color, we obtained four domains in the 3D scatter plot (Fig. 3A) that have common boundaries as evidenced on the three orthogonal projections on the planes (PC1, PC2) (Fig. 3B), (PC1, PC3) (Fig. 3C) and (PC3, PC2) (Fig. 3D). To improve the quality of our clustering procedure, we filtered out poorly clustered data points that are closer to another cluster than to the one they belong to (black dots in Fig. 3), where the distance between a data point and a cluster is defined as the mean of the distances of this point to all the points in the cluster. Removing those points is exactly equivalent as removing points with a negative silhouette [108] (Materials and Methods).
To determine the number of clusters, we used two statistical criteria (Materials and Methods). Four is the optimal choice according to the within-cluster sum of squares that clearly displays an elbow (abrupt slowing down of the decay) at the cluster number equal to four (Fig. 3B). The gap statistic [109] indicates that two or four clusters are good solutions (Fig. 3C). Our choice of four main chromatin states (Fig. 3A) can thus be seen as an attempt to test the limits of the classical dichotomic picture [65,74,75] of two chromatin states, one open (euchromatin) and another one closed (heterochromatin) ( Supplementary Fig. S2A).

Epigenetic content of the four prevalent chromatin states
The four prevalent chromatin states so identified and further labeled C1, C2, C3 and C4, were respectively found in 6572 (23.8%), 5312 (19.2%), 6603 (23.9%) and 6758 (24.4%) among the 27656 100 kb loci with a defined MRT (Materials and Methods). Indeed, we removed from the analysis the 2411 (8.7%) loci that were not properly classified in any chromatin state. More than 90% of the loci in C1 are associated (positive enrichment) with the histone modifications H3K36me3, H3K4me3, H3K27ac and H3K79me2, the hallmarks of transcriptionally active chromatin ( Fig. 4) [2,6,105], as well as of the loci associated with RNA Polymerase II (Fig. 5) and the RPD3-interacting protein SIN3A (Fig. 5) as previously found in active euchromatin in Drosophila [22]. The majority of C1 loci are marked by H3k9me1 loci consistently with the observation of higher H3K9me1 levels in active promoters [105], and also contains the histone variant H2AZ whose binding level was shown to correlate with gene activity in human [105] (Fig. 4). C2 is notably associated with the histone modification H3K27me3 (Fig. 4), hence corresponds to a Polycomb repressed facultative heterochromatin state [105,106]. Out of the four main chromatin states, C3 corresponds to 100 kb loci that are not enriched for any available marks. C3 can be compared to the ''null'' or ''black'' silent heterochromatin regions previously found in Drosophila [21,22] and Arabidopsis [19] as covering a significant portion of the genome. C4 corresponds to the classic HP1-associated heterochromatin state with all of the 6603 C4 100-kb-loci containing the H3K9me3 mark and almost only that repressive mark (Fig. 4) [105,106].
Methylation of H3K9 is well known to be implicated in heterochromatin formation and gene silencing [2]. The fact that H3K9me1 is found almost equally in C1 and C2 and not in C4 ( Fig. 4), confirms that this epigenetic modification may also be associated with transcriptional activation [105]. H3K9me3 is found in all C4 100-kb-loci as the probable signature of its ability to anchor the heterochromatin protein HP1 at the origin of the establishment of heterochromatin. But H3K9me3 is not exclusively found in C4 loci; indeed 75% of C1 loci and 50% of C2 loci contain some H3K9me3 marks (Fig. 4). In the transcriptionally active state C1, H3K9me3 is present in combination with all active marks which might conduct in the anchoring of the c isoform of the HP1 protein [110][111][112][113], also called CBX3 (Fig. 5), which was recently shown to help the splicing of multiexonic genes [114,115].
The insulator-binding protein CTCF is known to establish chromatin boundaries to prevent the spreading of heterochromatin into transcriptionally active regions [116,117]. Consistent with the idea that CTCF-bound insulators prevent heterochromatin to invade genic regions, we found in good agreement with previous observation in Drosophila [21,22] that CTCF is contained in C1 loci and to a slightly less extent in C2 loci (Fig. 5).
Despite the original association of H4K20 methylation with repressive chromatin [2], H4K20me1 was recently shown to strongly correlate with gene activation [105]. In particular when combined with H3K36me3 and H2BK5me1, this mark was found at highly expressed exons near human gene 59-ends [118]. The high level of H4K20me1 found in C1 ( Fig. 4) is quite consistent with these observations. However, we observed the same level of H4K20me1 in C2 which is silent. This suggests that this mark is not uniquely linked to transcription activation. Interestingly, recent works have confirmed that PR-Set7 involved in the deposition of H4K20me1 plays an important role in the control of replication origin firing in mammalian cells [119][120][121].
To assess the generality of the four prevalent chromatin states, we ran the same clustering procedure on the lymphoblastoid cell line GM12878 and on a third blood cell line (Monocyte CD14z, Monocd14ro1746). The same four main chromatin states emerged in the three cell lines (Supplementary Figs S7, S9, S10, S11). Hence the chromatin organization in four chromatin states is shared by at least several somatic human cell lines.

Chromatin states are replicated at different times during S phase
This classification into four main chromatin states of the human genome shows strong similarities with those recently reported in Arabidopsis [19] and Drosophila [21,22] suggesting the possible existence of some simple principles of epigenetic compartimentalization of eukaryotic genomes. However, what our study reveals with respect to previous works, is a strong correlation between these chromatin states and MRT (Fig. 6). C1, C2, C3 and C4 actually have significantly different MRT probability distribution functions (Fig. 6A) with a clear shift from early to late replicating as evidenced by the cumulative distribution functions (Fig. 6B). By applying a wilcoxon test to each pairs of chromatin states, we did verify that the p-value was infinitesimal. The transcriptionally active euchromatin state C1 replicates early in S phase consistent with previous analysis of open chromatin marks in human and mouse [56,59,61,62,64,65]. The Pc-repressed facultative hetero-chromatin state C2 is replicated slightly later in mid-S phase which corroborates the recent finding of an association of H3K27me3 with mid-replicating chromosomal domains in human fibroblast [106]. This rather clear observation contrasts with previous contradictory results concerning the existence of high correlation between late replication and this repressive chromatin mark [65,122]. The silent unmarked chromatin state C3 replicates later than C2 but before the HP1-associated heterochromatin state C4 that replicates very late almost at the end of S phase (Fig. 6). As previously reported in Drosophila [22,63], these results confirm the existence of a strong link between epigenetic chromatin states and MRT in human. They further suggest that the epigenetically controlled chromatin structure has some impact on the normal progression of S-phase.

Chromatin states are different functionally
To address the question of the gene content of these four prevalent chromatin states, we used a data set of 23818 genes that are spatially distinct (Materials and Methods). Some of these genes (3001) were not taken into account in our analysis because their promoter don't belong to any chromatin state. The mean density of the 20817 genes that belong to one of the four chromatin states is 8.24 promoters per Mb. The only chromatin state that is highly enriched in gene promoters is the early replicating euchromatin state C1 that harbours 62.0% of gene promoters even though it represents about 25% of the total genome coverage by the four chromatin states (Table 1 and 2). The mid S facultative heterochromatin state C2 also contains a non negligible percentage (19.6%) of gene promoters that indeed corresponds to a modest density 7.7 promoters/Mb as compared to 19.1 promoter/ Mb found in C1. The late replicating unmarked and constitutive heterochromatin states C3 and C4 are genuinely gene deserts with very low gene densities 4.1 promoters/Mb and 1.8 promoter/Mb respectively. The mean gene length increases gradually from C1 to C4 going from 42.5 kb to 133.1 kb (Table 1). This discrepancy in gene length explains why the gene coverage decreases less abruptly than the promoter density, with C1 mainly genic (62.9%), C2 modestly genic (49.8%) and C3 (39.5%) and C4 (29.3%) mostly intergenic.
To investigate gene expression in chromatin states, we used a data set of 17872 genes with a valid expression value in K562 (Materials and Methods). Of those genes, 15869 belong to one of the chromatin states. We found that a vast majority of expressed genes with a RPKMw1 (Equation 7) are in the early replicating euchromatin state C1 (Fig. 7B), which confirms the link between MRT and expressed gene density previously reported in mammals [55,58,59,61]. As expected, most of the genes in the facultative Pc repressed heterochromatin state C2 are non expressed. Interestingly, we found that the density of non expressed genes in C1 is equivalent to the one in C2, indicating that it is more the predominance of active genes that characterizes early replicating regions than the absence of repressed genes. This explains why the correlation between MRT and gene expression is stronger if one considers the expressed gene density (R~0:58, Pv2:10 {16 )) than the mean expresssion (R~0:24, Pv2:10 {16 ) as previously observed in Drosophila [54]. Indeed in C1 the mean gene expression level is lowered by the presence of a non negligible set of non-expressed genes. The few genes in the heterochromatin states C3 and C4 are silent except a minority of them.
We assessed gene function on the basis of gene ontology [123]. We analyzed the genes in each chromatin states according to their biological process ( Supplementary Fig. S3  enrichment p-value using the Hypergeometric distribution and used the odd ratio value to determine if the deviation from expected number of genes for the considered GO terms was an enrichment (odd ratiow1) or a depletion (odd ratiov1). As previously observed for gene expression, these GO terms provide some clear discrimination between genes in the early replicating transcriptionally active euchromatin C1 and genes in the repressed heterochromatin states C2, C3 and C4. Genes enriched in C1 are almost systematically depleted in C2, C3 and C4, whereas on the opposite, genes that are depleted in C1 are enriched in at least one if not all the heterochromatin states C2, C3 and C4. We found C1 to be enriched mainly in housekeeping genes. The highest enrichments were obtained for the following process categories: mRNA processing, translation, ribosome biogenesis, DNA metabolic process, chromosome organization and segregation, cell cycle and cell division and for the corresponding component categories: ribosome, chromosome, nucleolus, nucleoplasm, nuclear envelope, mitochondrion and microtubule organizing center. The highly depleted process categories in C1 correspond to tissue specific genes that are not expressed in the immature myeloid K562 cell line as for example neurological system process, extracellular matrix organization, cell adhesion and cell motility, or that are defficient in these cancer cells like circulating system process [124,125].

Compositional content of chromatin states
Along the line of the isochore model [126], GC-rich and GCpoor regions were shown to match the cytogenic R and G bands and to correlate well with early and late replicating domains in mammals [8,127,128]. GC-rich regions correspond to regions of very high density of genes including the housekeeping genes and associated CpG islands. This also correspond to regions enriched in short inter-dispersed repetitive DNA elements (SINEs, Alu) [8].
In contrast, GC-poor regions are definitely poor in genes, predominantly tissue-specific genes containing rather large introns, but are relatively rich in long inter-disperse repetitive DNA elements (LINES) [8] that are significantly more abundant in these regions. Consistently, we found that the early replicating euchromatin state C1 has a GC content distribution shifted to higher values as compared to the unmarked and constitutive heterochromatin states C3 and C4 respectively (Fig. 8A). C1 is definitely GC-rich with an mean value GC~44:0% that is significantly higher than the genome average (GC~41:0%). On the opposite C3 and C4 are GC-poor with GC~39:3% and 36.7%, respectively. Surprisingly, the Pc repressed facultative heterochromatin state C2 has a GC content distribution similar to the one obtained for C1 (Fig. 8) with GC~44:0%. This means that if a high density of early replicating and highly expressed genes implies a high GC content, the reciprocal is not true. For example, C2 loci corresponding to 18% of the genome are GCrich (Fig. 8A) but gene poor ( Table 1) and most of these C2 genes are silenced by Pc proteins.
Cytosine DNA methylation is a mediator of gene silencing in repressed heterochromatin regions, while in potentially active open chromatin regions DNA is essentially unmethylated [129,130]. Methyl-cytosines being hypermutable, prone to deamination to thymines, CpG o/e ratio (Materials and Methods) is commonly used as an estimator of DNA methylation, the higher this ratio, the lower the methylation [101,131]. When computing CpG o/e after removing the CpG islands (CGIs) that are short unmethylated regions rich in CpG, in the four chromatin states, we found a significant shift of the CpG o/e pdf to smaller values when going from C1 (CpG o=e~0:202) to C2 (CpG o=e~0:195), C3 (CpG o=e~0:164) and C4 (CpG o=e~0:156) (Fig. 8). Thus relative to the genome average value CpG o=e~0:177, the early replicating transcriptionally active euchromatin state C1 is clearly hypomethylated. The mid-S repressed facultative heterochromatin state C2 is also, but at a lesser extent, less methylated than the entire genome. As expected the late replicating unmarked and constitutive heterochromatin states C3 and C4 are definitely methylated, the later being significantly more methylated than the entire genome. Thus the differences in CpG o/e (Fig. 8B) and MRT (Fig. 6A) observed in the four chromatin states C1, C2, C3 and C4, explain the significant correlation observed genome wide between methylation and replication timing (R~0:402, Pv2:10 {16 )) [101].
Note that chromatin state compositional content in Mono-cd14ro1746 is quite the same as in K562 (Supplementary Fig.  S11). In constrast, C3 and C4 in GM12878 have exchanged their GC and CpGo/e distributions ( Supplementary Fig. S9). Interestingly, this phenomenon is paired with C3 becoming more late in GM12878 than C4 (Supplementary Fig. S9). This observation suggests that the genomic regions that replicate late in S phase are more likely specified by sequence features than by epigenetic features. However, the GC content cannot be the primary determinant of MRT for C1 and C2 states. Indeed the GC distributions in C1 and C2 are nearly the same (Fig. 8A, Supplementary Fig. S9A and S11A) whereas a great discrepancy is observed in the MRT distributions (Fig. 6, Supplementary Fig.  S8 and MRT data non available).

Repartition of chromatin states along human chromosomes
Once mapped on the genome (Fig. 9A,B), the four prevalent chromatin states differ not so much in the genome coverage but mainly in their number and length distribution of domains or blocks of adjacent 100-kb-loci in the same chromatin state (Table 2 and Fig. 9C). C1 and C2 chromatin blocks are more numerous but they are shorter with a mean length L L~275 kb and 228 kb respectively. Their length pdfs do not reveal many domains larger than 1 Mb. C3 chromatin blocks are slightly less numerous and also mostly short, the larger mean length L L~325 kb resulting  For each chromatin state, the following information is given:  from the existence of a few large C3 streches of several Mb length. The C4 block length pdf definitely differs from the previous ones by the presence of a fat tail. Not only the mean length L L~718 kb is about three times the ones of C1, C2 blocks, but most of the C4 domains exceed 1 Mb up to 5 Mb and more, hence they are less numerous (Fig. 9C). This observation is quite consistent with the HP1-associated classical heterochromatin spreading mechanism and its possible association with the nuclear envelope [3,6].
When looking at the distribution of chromatin states along human chromosomes (Fig. 9A,B), there is a clear evidence that C1, C2, C3 and C4 blocks are not distributed independently. In large regions with MRT = 0.4, short C1 and C2 blocks intersperse with each other, the C1s being the earliest ones (e.g from 158 to 161 Mb in Fig. 9A). In a few 100 kb wide regions of MRT^0.6, C3 blocks are observed with a repressive effect (e.g around 156 Mb in Fig. 9A where chromosome 1 contains a lot of olfactory receptor genes).     Fig. 9A). This MRT dependent spatial organization of chromatin states prompted us to investigate neighborhood dependency between 100 kb loci. The obtained transition matrix ( Table 3) confirms that C4 loci have by far the highest probability (0.85) to have a C4 neighbor consistent with C4 blocks being much longer than the other chromatin state blocks ( Table 2 and Fig. 9C). It also quantifies the fact that C1 loci (and in turn blocks) have a much higher probability to have a neighbor that is a C2 locus (block) than a C3 or C4 locus (block) and vice-versa. This is consistent with the fact that C1 and C2 are likely to be replicated one after each other in early and mid S phase whereas C3 and C4 are replicated much later (Fig. 6). Consistently C4 loci (blocks) have a highest probability to have a neighbor that is a C3 locus (block) whereas C3 loci (blocks) have apparently no special preference. The spatial organization of chromatin blocks suggests that we can associate C1+C2 on one side and C3+C4 on the other side ( Supplementary Fig. S2B) resulting in large-scale blocks of surprisingly very similar length distributions (Fig. 9D) with fat tails and respective means 779 kb and 808 kb. These mega-base long C1+C2 and C3+C4 chromatin blocks would on average be replicated rather early (Fig. 9E) and late (Fig. 9F), respectively. Importantly, fixing the number of chromatin states to two in our PCA and cluster analysis does not result in the same dichotomic picture (Supplementary Fig. S2A). Instead we discriminate the active chromatin state C1 from a composite silent state C2+C3+C4 (Supplementary Fig. S2B) Note that when using the so-computed transition matrix between chromatin states (Table 3) to generate randomly synthetic chromosomes, we obtained very good predictions for the four chromatin state block mean lengths (Table 2). However the corresponding sample standard deviations so predicted are significantly smaller than the ones computed for the genuine human chromosomes which is an indication that the succession of chromatin states along human chromosomes is probably governed by a more global and elaborated underlying segmentation process.

Distribution of chromatin states inside replication timing U-domains
When concentrating our study on the 876 replication timing Udomains previously identified in K562 cells [80], we revealed some remarkable organization of the four prevalent chromatin states (Fig. 10A). The highly expressed gene rich euchromatin state C1 is found to be confined in a closed (, , 150kb) neighborhood of the ''master'' replication origins that border each individual Udomains (Fig. 10A). As confirmed on the mean occupation profiles obtained for four U-domains size categories (Fig. 10 E, F, G, H), this confinement is independent of the U-domains size and consistent with the previous observation [80,101] that U/Ndomain borders are significantly enriched in DNase I hypersensitive sites and in insulator-binding proteins CTCF. C1 can thus be seen as specifying the early initiation zones that border U-domains and that were further shown [80] to delimit topological domains on genome-wide (Hi-C) chromatin state conformation data. The Pc repressed heterochromatin state C2 is mostly found at finite distance (,200-300 kb) from U-domain borders as clearly seen on the largest U-domains whose centers are drastically devoided of C2 loci (Fig. 10B,H). In small U-domains (v1:2Mb), C2 occupies in majority their centers (Fig. 10E,F) that are replicated in mid-S phase. U-domain borders are also significantly depleted in unmarked and constitutive heterochromatin states C3 (Fig. 10C) and C4 (Fig. 10D), respectively. C3 is already present in the center of small U-domains (Fig. 10E,F) and homogeneously occupies large U-domain centers (Fig. 10G,H). C4 is significantly found in the center of U-domains that are larger than 1 Mb; C4 spreads and becomes predominant when increasing the size of U-domains beyond 1.8 Mb (Fig. 10G,H). These results show that the replication ''wave'' starting from the early initiation zones at Udomain borders and propagating inside U-domains during Sphase with the progressive activation of secondary replication origins [79], actually corresponds to a directional path through the four prevalent chromatin states C1, C2, C3 and ultimately C4 in the largest U-domains. This gradient of chromatin structure, from active openess at U-domain borders to closeness at U-domain centers via intermediate Pc repressed and unmarked heterochromatins is likely to be a key ingredient in the long-range chromatin control of the spatio-temporal replication program that underlies the megabase-sized replication fork polarity gradients observed in about 50% of the human genome [79,80].

Conclusion/perspectives
In summary, this integrative analysis of epigenetic mark maps in the immature myeloid human cell line K562 has shown that the combinatorial complexity of these epigenetic data can be reduced to four prevalent chromatin states, one transcriptionally active open euchromatin state C1 and three distinct and silent heterochromatin states, namely a Pc repressed state C2, a unmarked silent state C3 and a HP1-associated constitutive state C4. By performing this statistical study at the (low) resolution  The first line is the probability of each chromatin state. The matrix below the first line is the Markov transition matrix between states (see Materials and Methods for its estimation). A value at the i th row and the j th column is the probability to find the chromatin state j in a 100 kb window next to a 100 kb window of chromatin state i. D corresponds to 100 kb windows that are not classified in any chromatin state. doi:10.1371/journal.pcbi.1003233.t003 100 kb of available genome-wide MRT data, we have found that these chromatin states actually replicate at distinct periods of the S-phase, C1 replicates early, C2 is a mid-S phase phase state whereas C3 replicates later than C2 but before C4 that replicates very late, almost at the end of S-phase. In the Supplementary Data are reported, for comparison, the results of a similar integrative analysis of epigenomic data in the lymphoblastoid cell line GM12878 (Supplementary Figs S6, S7, S8 and S9) and in the blood cell line Monocd14ro1746 (Supplementary Figs S10, S11), which confirm that the classification of the human epigenome in four main chromatin states likely summarizes the data in different cell types. Interestingly, these four main chromatin states display remarkable similarities with that found in different cell types in Drosophila [21] and Arabidopsis [19] at the resolution ,1 kb of gene expression data, suggesting the existence of simple principles of organization in metazoans as well as in plants [19][20][21][22]. When mapping these four chromatin states along the human chromosomes, our study reveals that the human genome can be segmented into megabase-sized domains of three different types with distinct spatio-temporal replication programs. In 50% of the human genome that are covered by the replication U-domains [80], the U-shape of the replication timing profile indicates that the effective replication velocity (which equals the inverse of the replication timing derivative [80,98]) increases from U-domain borders to centers [79] as the signature of an increasing origin firing frequency during S-phase [132]. Our results (Fig. 10) show that this acceleration of the replication wave is actually observed along a directional path through the four main chromatin states, the open euchromatin state C1 at U-domain borders successively followed by the heterochromatin states C2, C3 and C4 at the Udomain centers. To which extent this chromatin gradient influences fork progression from the ''master'' early initiation zones at U-domain borders and secondary origins activation inside U-domains is a key issue of current modeling [79,[133][134][135] of the spatio-temporal replication program in human and more generally in mammals. The complete analysis of the other half of the human genome that is complementary to U-domains is more in agreement with the traditional dichotomic picture proposed in early studies of the mouse [55][56][57] and human [59,65,75] genomes, where early and late replicating regions occur in separated compartments of open and close chromatin, respectively. About 25% of the human genome are covered by megabase sized GC-rich (C1+C2) chromatin blocks that on average replicate early by multiple almost synchronous origins with equal proportion of forks coming from both directions (Table 4). This absence of well-positioned origins explains that the skew has not accumulated in these gene-rich regions that were shown to be devoided of skew N-domains [90][91][92][93]. The last 25% of the human genome corresponds to megabase sized GC-poor domains of interspersed (C3+C4) heterochromatin states or of long C4 domains that on average replicate late by again multiple almost coordinated origins (Table 4). These gene-poor regions are also devoided of skew N-domains and can be seen as the late replicating counter-part of the gene-rich (C1+C2) regions. Extending this study to different cell types including ES, somatic and cancer cells looks very promising. By performing our integrative analysis at low (100 kb) and high (1 kb) resolutions in parallel, we should be in position to investigate the global reorganization of replication domains during differentiation (or disease) in relation to coordinated changes in chromatin state and gene expression. For example, this multivariate approach should shed a new light on the so-called replication domain ''consolidation'' phenomenon [56] that corresponds to the disappearance (EtoL transition) or appearance (LtoE transition) of a U-domain border during differentiation [80].The probable coordinated change in chromatin state at 100 kb resolution and the possible change at 1 kb resolution are likely to explain the possible change in gene expression. This opens new perspectives in the study of chromatin-mediated epigenetic regulation of transcription and replication in mammalian genomes in both health and disease.

Mean replication timing data and replication U-domain coordinates
Timing profiles for the immature myeloid cell line K562 and the lymphoblastoid cell line GM06990 were obtained from the authors [80]. The mean replication timing (MRT) is given for 27656 100 kb non-overlapping windows in hg18 coordinates. We also retrieved the coordinates of the 876 U-domains in K562 and 882 U-domains in GM06990 from the authors [80].

Epigenetic profile computation at 100 kb resolution
For each ChIP-seq data, we computed a profile at the 100 kb resolution for the 27656 non-overlapping windows for which MRT is defined. The read density for one antibody in a window is the number of reads in this window that fall in significantly enriched intervals normalized by the window length.

Rank transformation and Spearman correlation matrix
All statistical computations were performed using the R software (http://www.r-project.org/).
In order to compute the Spearman correlation matrix, the epigenetic profiles at 100 kb resolution were transformed with the R function rank with option ties.method = max. Then we computed the Pearson correlation matrix on the transformed dataset. To reorder the matrix in Fig. 1, we computed the Spearman correlation distance dSCor as: where SCor is the spearman correlation. Then, a dendrogram was computed using the R function hclust with option method = average and with dSCor as dissimilarity.

Principal component analysis
Principal component analysis was performed on the rank transformed dataset using the function dudi.pca from the R package ade4 (see http://pbil.univ-lyon1.fr/ADE-4 and Ref. [107]) with the option scale = TRUE (i.e. each variable is centered and normalized before the PCA computation). The first three components were retained which accounts for 76% of the dataset variance (see Supplementary  Fig. S1), and clustering was performed in this 3D space.

Clustering strategy
We used Clara algorithm [88] which is an optimization of kmeans for large data set. We used the clara function implemented in the R package cluster. The options were set to: stand = FALSE, sampsize = 500, samples = 20, metric = euclidean.
To assess the number of clusters, we used the pooled withincluster sum of squares around the cluster mean. Suppose that the data set of size n is divided in k clusters C 1 ,C 2 , . . . ,C k . Let d(x,y) be the euclidean distance between the points x and y. Let x x i be the mean of the i th cluster, then the within-cluster sum of squares for this cluster is: The pooled within-sum of squares for the k clusters is: The pooled within-cluster sum of squares necessarily decreases with the number of clusters. A good choice for the number of clusters is the critical point where some clear crossover is observed from a fast decrease of W k at small k values to a weak decrease of W k at large k values. This means that, after this critical point, no much information is gained by adding a new cluster. In our analysis this crossover occurs for k = 4 clusters (see Fig. 3B).
We also used the Gap statistic [109] which is defined by : E n (ln(W k )) is the expected value of ln(W k ) for a sample of size n drawn from a proper reference distribution. We choose, as a reference, a uniform distribution over the range of the observed data. A good choice for the number of clusters is a value of k so that W k is much smaller than the expected W k from a random distribution (i.e. a high value of Gap n (k)). Four clusters is also a reasonable choice according to the gap statistic index computed with R package clusterSim (see Fig. 3C).
Poorly clustered data points were removed from the set of chromatin states. The silhouette value [108] is a way to quantify how well a point is clustered.
Definition 1. Given a particular clustering, C 1 ,C 2 , . . . ,C k , of the data in k clusters, let i be a data point and d(i,C j ) the average distance of the data point i to the members of the cluster C j . Let i be a member of cluster C c and The silhouette value of the data point i is defined as: A silhouette value below 0 means that the data point is actually closer in average to the points from another cluster than to the one it has been assigned to. Points with a negative silhouette value are border line allocations. We decided to remove those points from the set of identified chromatin states. Hence chromatin states are groups (clusters) with homogeneous epigenetic features. 91% of all 100 kb non-overlapping windows of the human genome were assigned to one of the four chromatin states C1, C2, C3 or C4.

Markov transition matrix estimation
The number of transitions from i to j, n ij , is the number of 100 kb windows of state i contiguous to a window of state j (the sense or antisense orientation is not taken in account). Let n i be the number of windows in chromatin state i. The conditional probability of a transition from i to j given i is nij ni .

Annotation and expression data
As human gene coordinates, we used the UCSC Known Genes table. When several genes presenting the same orientation overlapped, they were merged into one gene whose coordinates corresponded to the union of all the overlapping gene coordinates, resulting in 23818 distinct genes.
Expression data were retrieved from the Genome Browser of the University of California Santa Cruz (UCSC). To construct our expression data set, we used RefSeq Genes track as human gene coordinates. Genes with alternative splicing were merged into one transcript by taking the union of exons. Hence the TSS was placed at the beginning of the first exon. We obtained a table of 23329 genes. We downloaded expression valuess from the release 2 of Caltech RNA-seq track (ENCODE project at UCSC: http:// hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEn-codeCaltechRnaSeq/). Expression for one transcript is given in reads per kilobase of exon model per million mapped reads (RPKM) [136]. RPKM is defined as: where C is the number of mappable reads that fall into gene exons (union of exons for genes with alternative splicing), N is the total number of mappable reads in the experiment, and L is the total length of the exons in base pairs. We associated 17872 genes with a valid RPKM value in K562.

CpG o/e computation and GC content
CpG observed/expected ratio (CpG o/e) was computed as nCpG L{l | L 2 nC nG , where n C , n G and n CpG are the numbers of C, G and dinucleotides CG, respectively, counted along the sequence, L is the number of nonmasked nucleotides and l is the number of masked nucleotide gaps plus one, i.e. L-l is the number of dinucleotide sites. The CpG o/e was computed over the sequence after masking annotated CGIs. The GC content was computed on the native sequence.

Chromatin state blocks
We detected contiguous windows of the same chromatin state (C1 to C4). We then kept the coordinates of the blocks of contiguous windows. To form chromatin state blocks of states (1+2), we merely detected contiguous windows of state 1 or 2. The same procedure was applied to define chromatin blocks of states (3+4). For chromatin blocks (1+2) and (3+4), we authorized the inclusion of isolated windows which don't belong to any chromatin state so to not disrupt very long blocks.

GO term enrichment
Each gene name of our annotation dataset was associated to several GO terms from GO SLIM (high level GO terms) using the online mapper: http://go.princeton.edu/cgi-bin/ GOTermMapper. Then for each chromatin state (C1 to C4), the number of occurrences of each GO term was determined by the number of promoters belonging to that state and associated to this GO term. The enrichment for each GO term in each cluster was tested using Fisher's exact test. We applied a procedure to control the false discovery rate (FDR) as described in [137]. The upper limit of the FDR was fixed to 20%. After detecting significant deviation from a random repartition of GO term occurrences, we used the odd ratio value to determine if the deviation was an enrichment (odd ratiow1) or a depletion (odd ratiov1).