The long-range interaction map of ribosomal DNA arrays

The repeated rDNA array gives rise to the nucleolus, an organelle that is central to cellular processes as varied as stress response, cell cycle regulation, RNA modification, cell metabolism, and genome stability. The rDNA array is also responsible for the production of more than 70% of all cellular RNAs (the ribosomal RNAs). The rRNAs are produced from two sets of loci: the 5S rDNA array resides exclusively on human chromosome 1 while the 45S rDNA arrays reside on the short arm of five human acrocentric chromosomes. These critical genome elements have remained unassembled and have been excluded from all Hi-C analyses to date. Here we built the first high resolution map of 5S and 45S rDNA array contacts with the rest of the genome combining over 15 billion Hi-C reads from several experiments. The data enabled sufficiently high coverage to map rDNA-genome interactions with 1MB resolution and identify rDNA-gene contacts. The map showed that the 5S and 45S arrays display preferential contact at common sites along the genome but are not themselves sufficiently close to yield 5S-45S Hi-C contacts. Ribosomal DNA contacts are enriched in segments of closed, repressed, and late replicating chromatin, as well as CTCF binding sites. Finally, we identified functional categories whose dispersed genes coalesced in proximity to the rDNA arrays or instead avoided proximity with the rDNA arrays. The observations further our understanding of the spatial localization of rDNA arrays and their contribution to the architecture of the cell nucleus.


Introduction
Ribosomal RNAs (rRNAs) are essential components of the cell, and are encoded in the 5S and 45S ribosomal DNA (rDNA) arrays of higher eukaryotes [1][2][3][4]. The 5S rDNA array resides on chromosome 1 and encodes the 5S rRNA, whereas the 45S rDNA array resides on five human acrocentric chromosomes and encodes the 18S, 5.8S, and 28S rRNA components of the ribosome [5][6][7]. The nucleolus, the first recognized nuclear organelle, is the site of 45S rRNA transcription [1,2,4,8]. The lack of homology between the 5S rDNA and the subunits of the 45S rDNAs arrays reflect deep evolutionary separation. For instance, RNA polymerase I is exclusively dedicated to the transcription of the 45S rRNA, while RNA polymerase III transcribes the 5S rRNAs and tRNAs. The distinct RNA polymerase machineries required for transcription of 5S and 45S subunits are a conserved feature of yeasts, plants, fruit flies, and humans. Furthermore, distance to the nucleolus is thought to be relevant for global gene expression. For instance, proximity to the nucleolus can in some cases promote inactivation of certain RNA polymerase II transcribed genes [9], although the observation has not been systematically tested across the genome. Finally, localization of the 5S array has been documented at the periphery of the nucleolus [9,10], but also away from the organelle [11], with a substantial fraction of cells showing 5S arrays that are localized elsewhere in the nucleus [10]. Uncovering physical contacts between the rDNA arrays and the rest of the genome can expand our understanding of nuclear architecture, nucleolar structure and function, and the mechanism of concerted copy number variation between 5S and 45S rDNA arrays. However, studies of nuclear architecture have largely excluded analyses of spatial interactions with the 5S and 45S rDNA arrays.
Ligation-capture Hi-C sequencing technology [12][13][14] enabled a revolution in our understanding of nuclear organization with the identification of hundreds of topologically associated domains (TADs). Human TADs span an average 900 KB each and display remarkable conservation with TADs identified in mice. TADs display, moreover, remarkable structural stability through development and when cells are perturbed in gene knockdown experiments [15,16]. On the other hand, deep sequencing of nucleoli led to the documentation of nucleoli associated DNA (naDNA) and the identification of nucleolus associated domains (NADs) [17][18][19]. While NADs display size variation spanning multiple orders of magnitude, they are generally large. NADs covering less than 0.1 MB are relatively rare with most NADs around 1 MB or larger. The domains encompass about 5% of the human genome, are represented in all chromosomes, and are now recognized to be stably associated with nucleoli. Analysis of rDNA interactions with Hi-C might provide a complementary approach to localize the rDNA in the nuclear space possibly informing nucleolar interactions with the genome at a different scale than those afforded by analysis of naDNA.
Here we addressed the landscape of long-range rDNA interactions with 16,482,743 reads identified from a total of >15 billion (15,165,355,427) Hi-C reads in five cell types and two cell lines. The data enabled a map of long-range rDNA interactions at 1MB resolution, and the identification of segments displaying statistically significant differential contact density between cells. The map yielded a number of observations and suggest that the 5S and 45S arrays are not as spatially close as typically expected, yet share significant overlap with common contacts elsewhere. Finally, the data uncovered functionally coherent categories whose dispersed genes either coalesce in proximity to the rDNA arrays or avoid proximity with the rDNA arrays.

Ribosomal DNA containing reads in Hi-C
We investigated human Hi-C data for two cell lines and five cell types; the two cell lines represent the most replicated human Hi-C datasets to date, yet yielded a relatively small number of rDNA informative reads. For instance, we mined 5,356,990,189 high quality Hi-C reads in LCL to identify 13,528,436 reads with at least one end mapped to the 45S rDNA and 105,147 reads with at least one end mapped to the 5S rDNA (S1 and S2 Tables). Similarly, for K562 cells, we mined 903,837,936 high quality Hi-C reads to identify 1,698,063 reads with at least one end mapped to the 45S rDNA and 47,691 reads with at least one end mapped to the 5S rDNA. This represents a 0.25% and 0.19% recovery rate of 45S rDNA reads in shotgun Hi-C in LCL and K562, respectively. These numbers were substantially larger than the meager 0.002% and 0.005% recovery rate for 5S rDNA reads in LCL and K562, respectively. Similar recovery rates were obtained with the other five cell types studied (Table 1). Overall, we uncovered 16,322,538 reads with at least one end mapped to the 45S rDNA and 160,205 reads with at least one end mapped to the 5S rDNA ( Table 1). The mining effort illustrates the challenge in recovering rDNA information in shotgun Hi-C experiments. Nevertheless, the data revealed that rDNA contacts are dispersed across the entire genome, with segments differing in the density of rDNA interaction. The maps also revealed that naDNA and rDNA-contacts are not overlapping domains and likely reflect different attributes of the nucleolus/rDNA (S1 Fig).

Ribosomal DNA contact maps at 1MB resolution
Here we partitioned human autosomes (Chr 1 to 22) into 2897 segments of 1MB, 2465 and 2658 of which had no evidence of containing a 5S or 45S pseudogene, respectively. Segments containing an rDNA pseudogene were disproportionately found adjacent to centromeric and telomeric regions, and were excluded from all further analyzes. Unsurprisingly, all 1MB segments across all chromosomes displayed evidence of rDNA contact (S1 and S2 Figs). Moreover, at the 1MB scale, we observed good reproducibility between replicates of a cell line using the same restriction enzyme as well as different restriction enzymes, with consistent results across biological replicates and across cell lines/cell types (S3, S4, S5, S6, S7, S8 and S9 Figs). Fig 1 illustrates the distribution of rDNA contact density for 1MB segments before normalization by sequencing effort. The data shows a 5-10-fold variation in the logarithm of the contact density across segments within a cell type. The mean difference in the average contact density among cells reflects variation in the amount of Hi-C data in each cell type. For 45S rDNA contacts all 1MB segments contained appreciable density of contacts in LCL and K562. However, the ESC and ESC-derived cell types (ESC set) displayed a truncated distribution with many segments that contained very few rDNA contacts ( Fig 1A). This was due to the lower number of Hi-C reads for those cells ( Table 1). The resolution was much worse for the 5S rDNA arrays ( Fig 1C). Therefore, the following analyses focused primarily in the data for LCL and K562 cell lines, with the ESC or ESC-derived cells mostly used for comparisons.

Differential ribosomal DNA contacts density at 1MB resolution
Here we first addressed variation in rDNA contact density across cell lines (LCL vs K562 data collected with the same enzyme and protocol). We found 808 segments of 1MB with  Table), whereas none is identified among biological replicates of LCLs processed with different enzymes (Fig 3). We observed that 350 DI segments displayed increased density in LCL and 458 segments displayed increased density in K562. Among those 808 DI segments, 302 of them displayed a greater than 2-fold difference in contact density between LCL vs K562. Similarly, nearly half of the 224 segments of 1MB in chromosome 1 showed evidence of DI density between LCL and K562 (  Table). Finally, we detected a meager 15 segments with evidence of differential DI between LCL and K562 for the 5S rDNA (FDR< 0.05); the small number of differential DI likely reflects the many fewer 5S rDNA reads and thus the much-lowered statistical power of this analysis. Similarly, there was not enough Hi-C data to enable statistical analysis of DI with the 5S rDNA among the five ESC related cell types. Landscape of rDNA contacts in Hi-C

Identification of rDNA-gene contacts
We identified 9,595 and 9,864 genes without evidence of a 5S or 45S rDNA pseudogene, respectively. The remaining genes were excluded from all further analyzes. The data showed a continuous distribution of rDNA-gene contact density for the 45S and 5S rDNA ( Fig 1B), with much better resolution for the 45S rDNA than for the 5S rDNA ( Table 2, Table 3). As expected, the rDNA-gene contact density was correlated with gene length. We have thus calculated the 45S contact density per gene per nucleotide ("Contacts per gene per nucleotide, CPGN"). This removed the correlations between gene length and 45S rDNA contacts and revealed that CPGN for the 45S rDNA arrays was strongly correlated between LCL and K562 (rho = 0.65; P < 0.001). This correlation was stronger than those between LCL and ESC (rho = 0.27; P < 0.001) or between K562 and ESC (rho = 0.34; P < 0.001). The lower correlations with ESC might partially reflect the lower resolution of the ESC contact map with a substantial fraction of genes showing less than 10 reads with rDNA contacts (Table 2). Indeed, although the overall amount of HI-C data was large, the resolution to ascertain 45S rDNAgene contacts was only sufficient for LCL and K562, the two biological sources with the largest number of Hi-C reads to date. The issue of low rDNA-gene resolution was particularly evident for the 5S rDNA. Out of 9595 genes analyzed for 45S rDNA arrays, there were 67 and 612 genes with zero 5S contacts in LCL and K562, respectively. For the ESC set, however, there were 1745 genes with zero contacts with the 45S rDNA arrays. Out of 9864 genes analyzed for 5S rDNA arrays, there were 5916 and 7494 genes with zero 5S contacts in LCL and K562,  Landscape of rDNA contacts in Hi-C respectively. For the ESC set, we observed that greater than 95% of the genes had zero 5S contacts. The density of 5S rDNA-gene contacts was most strongly correlated with gene length (rho > 0.3, P < 0.001), but calculating the 5S contact density per gene per nucleotide ("Contacts per gene, CPGN") removed the positive association. Among genes with at least one read showing 5S-rDNA contact in both LCL and K562 we found that CPGN is strongly correlated between LCL and K562 (rho = 0.64, P < 0.001). Evidence for a positive association between the density of 45S rDNA contacts and the density of 5S rDNA contacts is also observed in other partitions of the data, and across genes and 1MB segments in both LCL and K562 (Table 4).

Differential rDNA-gene contact density
Here we tested for variation in rDNA-gene contact density between LCL and K562. For the 45S array, we observed 731 genes with fold change in interaction density >2 for the LCL vs K562 comparison (experiments with the same enzyme and protocol); 97 genes (FDR < 0.05) displayed significantly different DI after multiple corrections (S10 Fig). For the analyses of 45S rDNA contacts variation among five ESC related cell types, we observed 435 genes with significantly differential density of rDNA contacts (FDR < 0.05). For the 5S array, we observed 954 genes with DI fold change >2 in the LCL vs. K562 comparison. However, none of these genes reached statistical significance, possibly due to the higher variance emerging from the low coverage and thus limited number of 5S contacts in each gene. There was not enough data for statistical analyses of variation in 5S rDNA contact among the five ESC related cell types.

Per nucleotide rDNA contact rates
Here we estimated contact densities per base pair in three ways. First, the average contact per base pair across the whole genome was calculated by dividing the total number of mapped rDNA-genome reads by the genome length (3 billion base pairs). The average contact rate is estimated as 4.8 x 10 −5 and 3.7 x 10 −3 contacts per base pair for the 5S and 45S rDNA, respectively (S5 Table). Hence, for the 45S rDNA each base pair in the genome is expected to have 0.37 mapped reads. Second, the average contact per base pair was estimated after filtering out Landscape of rDNA contacts in Hi-C bins with pseudogenes. Here we divided the total number of rDNA-genome reads within 1MB segments without a pseudogene by the total sequence length in those segments. This yielded an estimated average contact rate of 2.0 x 10 −5 and 1.7 x 10 −3 contacts per base pair for the 5S and 45S rDNA, respectively. These numbers are comparable with those estimates using all rDNA reads and the whole genome. Third, we estimated the average contact rate per base pair in protein-coding genes by dividing the total number of rDNA-gene reads by the total length of nucleotides within genes, after excluding genes with evidence of containing rDNA pseudogenes. This yielded an average contact rate for genic segments of 2.2 x 10 −4 and 0.016 contacts per base pair for the 5S and 45S rDNA, respectively (S5 Table). Collectively, these estimates of contact rate are useful in evaluating regions with putative enrichment or deficit in rDNA contacts.

rDNA contacts preferentially occur in close, repressed, late replicating domains
We examined the relationship between various genomic attributes and the density of rDNA contacts. First, the data showed a significant association between the number of 45S rDNA contacts and the A/B compartments. Specifically, the B compartment of closed chromatin displays an enrichment in rDNA contacts, whereas the A compartment of open chromatin displays a deficit of rDNA contacts (P < 0.01, Chi-square test; Fig 4). In addition, we examined 15 functional annotations; significant enrichments were observed in segments of repressive chromatin, as well as in segments annotated as repetitive or containing insulator regions (P < 0.01, Chi-square test; Fig 4; S11 Fig). Finally, we examined segments of CTCF binding; CTCF is a conserved 11-zinc finger DNA binding protein that regulates chromosome architecture [20]. Using the CTCF database we estimated that CTCF binding segments constitute <7.5% of the human genome. On the other hand, we observed that 37% and 29% of all 45S rDNA-genome reads overlapped a CTCF binding segment in LCL and K562, respectively. These figures are in good agreement with the 35% of all rDNA-genome reads that overlapped a CTCF binding segment in the ESC cell set. These represent a >4-fold enrichment that indicate a significantly higher percentage of 45S rDNA contacts with CTCF binding sites (P < 0.05, one proportion test).

Genes associated with rDNA CN variation are not enriched in rDNA contacts
We selected a small set of genes to be examined in greater detail. Specifically, we examined genes that are (i) known to regulate rDNA function or structure and/or (ii) whose expression are associated with rDNA CN variation [21][22][23]. For instance, the CTCF gene is located on Chr16 and displayed a meager 118 contacts with the 45S rDNA in LCLs, which is significantly lower (P-value < 0.001, one proportion test) than the expected 1198 contacts calculated based on the genome wide average contacts per base pair (1.56%) and the length of the CTCF gene. Thus, the CTCF gene appears to be in repulsion to the rDNA arrays. Similarly, CBX1(Hp1beta), Ubf1, and KDM4B had fewer hits than expected (P < 0.0001 for all of them, one proportion test). Thus, we examined the top 400 genes that are positively and negatively associated with rDNA CN variation in LCL [21]. Collectively, however, these genes were neither enriched nor depleted in rDNA contacts, with a distribution of contacts that is undistinguishable from all other genes in the genome (Fig 5). Nevertheless, nucleolar, mitochondrial, and ribosomal genes were also associated with variation in rDNA array CN [21], and could reveal a distinct pattern. Accordingly, genes that localize to the nucleolus as well as ribosomal genes showed a distribution of contacts that was significantly shifted towards a greater than average number of contacts with the rDNA array in both LCLs and K562 (Fig 5 and Fig 6).

Ribosomal and mitochondria-related genes are enriched in rDNA contacts
Next, we addressed if the higher density of rDNA contacts in nucleolar, ribosomal, and mitochondrial genes would emerge as significant gene ontology enrichments when genes with a high CPGN are selected. To address the issue, we examined the genes in the top 5% higher number of 5S and 45S contacts after correction for gene length (i.e., CPGN). For 5S rDNAgene contacts in LCL the cell component category of mitochondrion (GO:0005739) emerged on the top of the list, with 56 candidates (out of 494 genes) localized to the mitochondrion. The association is functionally intriguing and also emerged in the K562 dataset (S6 Table). The same class emerged among the top 5% in the 45S rDNA in LCL, with 63 candidates in the Landscape of rDNA contacts in Hi-C mitochondrion (GO:0005739; adjusted P < 0.05, after correction for multiple testing). The class includes interesting candidates such as seryl-tRNA synthetase 2 (mitochondrial SARS2; ENSG00000104835), tRNA 5-methylaminomethyl-2-thiouridylate methyltransferase (TRMU; ENSG00000100416) and tRNA methyltransferase 1 (TRMT1; ENSG00000104907), Era like 12S mitochondrial rRNA chaperone 1 (ERAL1; ENSG00000132591). In addition, 10 other functionally coherent cell components emerged for 45S rDNA-gene contacts in LCL (S7 Table; adjusted P < 0.05, for all classes in LCL; see S8 and S9 Tables for data on K562 and the Landscape of rDNA contacts in Hi-C ESC set). Four of those categories are highly significant GO terms containing the protein-components of the ribosome (GO:0005840~ribosome, GO:0022625~cytosolic large ribosomal subunit, GO:0015935~small ribosomal subunit, and GO:0022627~cytosolic small ribosomal subunit). Collectively, the data suggest that highly transcribed genes encoding protein constituents of the ribosome are co-localized in proximity to the rDNA arrays (Table 5). In addition, one GO term related to nucleolar function (GO:0005730~nucleolus) also emerged as significantly enriched with 39 genes in the top 5% of genes with higher numbers of 45S rDNA-gene contacts in LCL. Genes in this set include intriguing candidates such as NOP2 nucleolar

Developmental genes are depleted in rDNA contacts
Among genes in the bottom 5% CPGN in 45S, we observed seven HOX genes dispersed across several chromosomes (HOXA1, HOXA6, HOXA7, and HOXA11 on Chr 7, HOXB5 on Chr 17, HOXC11 on Chr 12, and HOXD13 on Chr 2), three of which showed zero 45S rDNA contacts [HOXA7(Chr 7), HOXC11(Chr 12), and HOXD13(Chr 2)] even in the dense LCL map. This suggests that developmentally regulated Hox genes are rarely localized in proximity to the rDNA arrays. Furthermore, we also found several other developmental genes in the set of 67 genes with zero contacts with 45S rDNA genes, further indicating that developmental genes show "repulsion" from the rDNA genes. Interesting candidates include NK2 homeobox 3 (NKX2-3) on Chr 10, BMP3 on Chr 4, BMP5 and BMP6 on Chr 6, as well as NOTCH1 on Chr 9. Interestingly, the histone cluster 1 H1 family member d (HIST1H1D) on Chr 6 also emerged without a single 45S rDNA contact in the dense 45S map of LCLs. Finally, we confirmed the lack of Hi-C contacts between the 5S and 45S arrays [24]. The segments proximal to the 5S array also displayed depletion in 45S rDNA contacts. The gene RHOU, for instance, is located adjacent to the 5S array and emerged in the bottom 3% of the distribution of 45S rDNA contact density.

Discussion
Multicopy ribosomal DNA arrays are essential components of the genome. Yet ribosomal DNA arrays are also among the most variable segments of the genome. The arrays have lagged behind with limited assemblies and little understanding of their nuclear localization. Here we report a detailed contact map of spatial interactions between the rDNA arrays and the rest of the genome. Although there are huge amounts of HI-C data, analyses of rDNA contact density for specific regions/genes remained a challenge because rDNA reads constitute a fraction of the Hi-C reads. Thus, we combined multiple Hi-C datasets to identify the subset of reads containing information on rDNA contacts. The effort was computational intensive because the fraction of rDNA reads in shotgun Hi-C is very small. This is particularly evident in the case of the 5S rDNA array: the contact data remained sparse even for LCL, the cell line that has by far the largest amounts of data collected from multiple Hi-C experiments. Nevertheless, we identified consistency of rDNA-gene contacts across different cells (LCL and K562; especially for 45S), which point to replicable spatial interactions. Heatmaps enabled visualization of rDNA contacts along the human genome with statistical analyses pinpointing significant differences in the density of contacts. While the approach can be applied to other multicopy genes as well as single copy genes or regions, we caution that the typical resolution of shotgun Hi-C is not sufficiently high. Indeed, limited resolution was apparent for both the 5S rDNA and 45S Landscape of rDNA contacts in Hi-C rDNA arrays, which required combining multiple datasets to ascertain contacts with genic and non-genic segments of the genome. In summary, the LCL map achieved good resolution for 5S and 45S contacts but the K562 set is quite sparse for 5S contacts, and both 5S and 45S maps are very sparse in the case of the ESC cell and ESC-derived cell types. Variation in rDNA contact density across genes reflects variation in proximity to the rDNA arrays. The data displayed over 100-fold variation in contact density across genes and revealed several intriguing patterns. First, the compilation enabled us to conduct statistical tests of the differential density of rDNA interactions between LCL and K562. These 45S maps are sufficiently dense, with differences in contact density likely reflecting differences in nuclear organization between these cells. These differences are not surprising since the LCLs are immortalized cells derived from lymphocytes whereas K562 is a myelogenous leukemia. K562 has, moreover, undergone genomic rearrangements [25]. While the data also suggested variation across ESC and ESC-derived cell types, greater coverage for these cells is necessary to draw sufficiently dense contact maps for a more fine-grained and meaningful biological contrast. An intriguing suggestion is that the rDNA/nucleolus represents a keystone in nuclear structure around which the rest of the genome is functionally organized [26] [24]. In this case, rDNA-contact differences between cells are bound to emerge and reflect functional variation.
Second, as a class, the rDNA proximity with genes previously identified as associated with rDNA CN variation across genotypes in human populations is undistinguishable from the background of genes. This indicates that genes impacted by rDNA CN are not spatially close to the rDNA arrays and are not enriched in direct rDNA contacts. This is not an unexpected observation, because the association of gene expression variation with rDNA CN includes hundreds of genes, with only a fraction of them likely to be directly regulated by the rDNA array (i.e. genes associated with rDNA CN are presumably modulated by both direct and indirect effects emerging from the rDNA). While we suggest that changes in nuclear architecture could be one way to for the rDNA to exert regulatory effects, the mechanisms through which rDNA CN directly modulates gene expression are likely varied and the ratio of direct to indirect effects is unknown. Third, we observed that genes encoding proteins that localize to the mitochondria display a disproportionally large number of contacts with the 45S rDNA. Concordantly, genes localized to the mitochondria also emerged as enriched in 5S rDNA contacts. The data suggests that genes localized to the mitochondria might be collectively regulated through aspects of nuclear architecture that are influenced by the rDNA. Noteworthy, connections between the rDNA array and mitochondrial gene expression and function have been uncovered before. In Drosophila, Paredes et al (2011) observed that engineered deletions in the rDNA array preferentially impacted the expression of genes whose protein products localized to the mitochondrion [27]. In humans, Gibbons et al (2014) observed that rDNA CN variation is associated with the expression of genes whose protein product localize to the mitochondrion as well as genes encoding protein components of the mitochondrial ribosome and mitochondrial DNA copy number [21]. Interestingly, in addition to its well-documented role as a structural component of the cytosolic ribosome, the 5S rRNA is also specifically imported into the mitochondria [28,29].
Fourth, we observed that genes localized to the nucleolus and encoding protein components of the ribosome were significantly enriched for 45S rDNA contacts. The finding points to the specificity of rDNA-genome interactions and suggests that ribosomal gene regulation might be directly influenced by the rDNA array. This pattern of rDNA-gene contacts might partially explain the observation that genes whose expression was correlated with rDNA CN included several candidates encoding the protein components of the ribosome. Indeed, sequence specific inter-chromosomal interactions between the yeast rDNA array and an intergenic segment adjacent to the largest RNA pol I subunit has recently been demonstrated [30].
All in all, our study identified functionally coherent genes and GO categories that are depleted and enriched in direct rDNA contacts. Ribosomal DNA contacted regions for all chromosomes along the human genome suggest a structural component underlying the global regulatory consequence of rDNA CN variation [21]. Finally, we note that as much as 29% of the 45S rDNA reads have both ends mapped in the 45S rDNA. These partially reflect linear proximity along the 45S rDNA unit but could also emerge from looping substructures with contacts between distant units; looping and contact among non-adjacent units has been suggested to facilitate ultra-structural organization of the array and coordinate transcription among rDNA repeat units [6,7,24,[31][32][33][34][35].
Concerted copy number variation (cCNV) refers to the correlation in copy number of 5S and 45S rDNA [36]. This co-variation in copy number across genotypes with variable rDNA array size is observed in human lymphoblastoid cells (LCLs) and occurs despite 5S and 45S rDNA residence on different chromosomes and lack of sequence homology between 5S and 45S rDNA subunits. Therefore, physical linkage between loci cannot explain the co-variation. On the other hand, spatial co-localization of the arrays as well as cellular processes of recombination such as those of micro-homology mediated end joining could conceivably contribute to the emergence of cCNV. Our results, however, confirmed a lack of direct 5S-45S contacts in Hi-C, an observation that is in agreement with a previous study [24]. This included a lack of 45S rDNA contacts with genes that are adjacent to the 5S rDNA array. The gene RHOU, for instance, is located next to the 5S and emerged in the bottom 3% of the distribution of 45S rDNA contact density. This indicates that the 5S and 45S rDNA are not in close enough proximity or that large protein complexes prevent the formation of 5S-45S Hi-C reads. The findings support the hypothesis that physical interactions occurring between 5S and 45S rDNA arrays are more restricted than previously anticipated. On the other hand, the denser maps presented here indicate that the 5S and 45S arrays share overlapping contact maps and many regions of the genome display a high density of contacts with both rDNA arrays. For instance, the density of 5S and 45S contacts is strongly correlated across genes and 1MB segments in both LCL and K562 cells. Whether or not this overlapping contact map is relevant for cCNV remains to be determined, but the evidence suggests that the two arrays are not completely independent. Coordination between them is likely to be relevant, with costs and benefits to 5S array proximity with the 45S arrays [24]. All in all, the association between contact density for the 5S and 45S arrays suggest that cCNV might be facilitated by structural proximity. Similarly, rDNA mediated structural changes in the nucleus might partially explain the regulatory consequences of naturally occurring variation in rDNA copy number [21].
From an evolutionary perspective, the co-existence of two clusters of rDNA loci (5S and 45S) might incur costs and benefits compared to rDNA residency on a single location. In some plants and yeasts, the 5S and 45S/35S rDNA subunits are spatially adjacent in the genome [7,[37][38][39][40][41], whereas in Drosophila and mammals, the 5S and 45S arrays reside on different chromosomes. However, the correlated contact maps for the 45S and 5S rDNA arrays suggest that they preferentially anchor at overlapping domains. This might narrow their spatial distances, and could explain why the 5S and 45S arrays can display apparent proximity to one another in a fraction of the cells as observed in cytological preparations. However, the lack of direct Hi-C 5S-45S contacts might suggest a model of competitive exclusion for similar anchoring sites, and predicts that a segment is in close proximity to either the 5S or 45S rDNA at each time. In cases of cytological proximity between the 5S and 45S arrays, large protein complexes might be present and prevent the emergence of direct 5S-45S inter-chromosomal contacts in the scale captured by Hi-C technology. Furthermore, the enrichment of rDNA contacts with ribosomal protein coding genes is surprising and might help explain the association between rDNA CN and the expression of these genes [21]. It suggests a structural component to the regulatory role of the rDNA and raises the possibility that the arrays might exert direct modulation of some genes via changes in nuclear organization. The data suggest that models that exclusively consider proximity to the rDNA arrays/nucleolus as a repressive modifier of gene expression might be overly simplistic. Rather, the distal and proximal association of genes with the rDNA arrays appears functionally motivated, as in the case of developmental genes or ribosomal genes. For instance, ribosomal gene proximity to the rDNA arrays could help facilitate coordinated Pol I, Pol II and Pol III responses. Collectively, these structural rDNA-mediated associations might have partially evolved to mitigate the fitness costs of dosage imbalances among highly expressed RNA and protein components of the translational machinery.

The 5S and 45S arrays
The human 5S rDNA along with flanking regions (chr1: 228,765,135-228,767,255) and the human 45S rDNA (GenBank reference number U13369.1, with modifications) were obtained as recently described [21,36,42,43]. The 45S reference comprises the 18S, 5.8S and 28S rRNA encoding segments, external transcribed sequences (ETS) and internal transcribed segments 1 and 2 (ITS1 and ITS2), as well as a~32 Kb non-coding intergenic spacer (IGS). Both 5S and 45S segments contain repetitive elements, such as Alu and Line1; all analysis carried out in this study used 5S and 45S sequences masked for these repeats.

Hi-C data sets
Raw Hi-C reads for LCLs and erythroleukemia K562 (K562) cells were downloaded from the Gene Expression Omnibus (GEO) repository with accession number GSE63525 [44]. Biological replicates with more than 1 technical replicate were included for a total of 6,017,877,658 reads in LCL and 1,366,228,845 reads in K562. In addition, raw Hi-C reads for five cell types were obtained from GEO data with SRA Study number SRP033089 [45]. The five cell types comprised the H1 embryonic stem (ES) cells and four differentiated cell-types derived from H1 [Mesendoderm (ME) cells, Mesenchymal stem (MS) cells, Neuronal Progenitor (NP) cells, trophoblast-like (TB) cells] [45,46]. The number of reads studied and recovery rates for 5S and 45S informative reads was summarized in Table 1.

Data preparation and mapping
All data were downloaded in SRA format and converted into FASTQ files by the NCBI SRA Toolkit's command (fastq-dump). FASTQ files were quality and adapter trimmed with Trim Galore. The trimming criteria required minimal quality score (> 20) and length (>50 bp). Next, we identified Hi-C reads that mapped to the 5S rDNA array or the 45S rDNA array. In this step, both forward and reverse reads were mapped independently to the 5S rDNA and 45S rDNA using Bowtie2 [47]. We used unpaired mapping with 'very-sensitive' mode (combinations of parameters: -D 20 -R 3 -N 0 -L 20 -i S, 1, 0.50). The mapping results were sorted and converted into binary format using SAMtools [48] and bed format using BEDTools [49]. We then extracted reads that mapped to the rDNA array and mapped the opposite end to repeat libraries. Reads for which one end mapped to repeats library were excluded. Finally, in order to identify potential confounders due to rDNA pseudogenes, both rDNA references were blasted against the human genome separately. Putative pseudogenes were identified as significant hits (E-value <1 × 10 −4 ) using BLASTN [50,51]. A segment of 1 MB was excluded from the analysis if an rDNA blast hit is identified within it. Similarly, a gene was removed from analysis if an rDNA blast hit is identified within its boundaries.

Detecting genome-wide physical interaction with rDNA loci at 1MB resolution
To identify spatial variation in genomic contact density along the chromosomes we segmented the human genome GRCh37/hg19 assembly into 3,173 bins of 1MB using BEDTools [49]. Bins with rDNA pseudogenes were excluded. Contact densities were summarized for each bin for each of 5 cell types and 2 cell lines. We calculated the number of Contacts Per Million reads (CPM) to normalize the data and control for different number of reads in each of the seven conditions. This placed all the data in a comparable scale, to enable visualization of contact density along the human genome using heat maps in the 'gplots' R package [52].

Analyses of rDNA-gene contacts
The term of "rDNA-gene contact" refers to reads with one end mapped to rDNA arrays and the other end mapped between the first and the last exon of an annotated gene in the human genome. We extracted coordinates of these reads using BEDTools [49] and the Gene Transfer Format (GTF) file: Homo_sapiens.GRCh37.75.gtf from the Ensembl database. GC content and length were also computed for each gene. To normalize contact densities in genes of different length, we computed the number of contacts per gene length in nucleotides (Contacts reads per gene per nucleotide, CPGN). The web based tool DAVID v6.8 [53] was used to investigate gene ontology enrichments for the top 5% of genes with greater CPGN for 5S rDNA or 45S rDNA genes. This corresponds to 494 out of 9864 genes for 5S-gene contacts, and 480 out of 9595 genes for 45S-gene contacts. The "one proportion" test [54] was also applied to address whether the number of mapped reads per base pair within a gene is significantly different from the genome wide average.

Statistical analysis of differential contact density between LCL and K562 cells
We modeled differential contact density per 1MB and per gene using the edgeR package and statistical approaches adapted from RNA-seq analysis [55,56]. Raw counts for physical contacts with rDNA loci within each bin along the human genome are modeled using generalized linear models (likelihood ratio tests) implemented in the edgeR package [55,56]. These approaches were recently been used to detect differential interaction density (DIs) in Hi-C data [19,57,58]. The models identified statistically significant differences among cell lines/ types in rDNA contacts density per MB and within genes. The Benjamini-Hochberg method was used for multiple testing correction [59], and statistical significance was denoted by FDR < 0.05. We applied the method to ascertain significant differences between LCL and K562 data from a single publication. For statistical comparison, we focused specifically on 11 biological replicates for LCL (collected with the Mbol enzyme) contrasted with two biological replicates for K562 (collected with the Mbol enzyme) and two biological replicates for LCL (collected with the DpnII enzyme). Each biological replicate consists of multiple technical replicates. We also evaluated variation among the five ES derived cell types, each with two biological replicates.

Functional annotation of rDNA contacts
We cross-referenced the rDNA contact map with several sources of functional annotation. First, Hi-C studies proposed the partition of the genome into A and B compartments that are widely interpreted as open and closed chromatin, respectively [60]. A/B coordinates were downloaded for LCL cells and 12 cancer types [60]. Second, coordinates of 15 functional regions identified in hESC using ChromHMM [61] were downloaded. Third, information on replication timing along the genome was downloaded from the Replication Domain Database (www.replicationdomain.org). Finally, CTCF binding coordinates were obtained from the CTCFBSDB database [62]. We extracted the coordinates for all the segments in each annotation and addressed its density of rDNA contacts. BEDTools was used to assess the number of mapped reads that overlapped with each annotated segment for each dataset. The percentage of mapped reads was calculated by dividing the number of reads mapped to the segment by the total number of mapped reads. The genome wide average read per base pair was used to compute the expected number of reads in the functional segment. Statistical significance was assessed with Chi-square tests. In addition, we applied the "one proportion" statistical test [54] to address whether the numbers of mapped reads per base pair within a functional segment (e.g., CTCF binding) is significantly different from the genome-wide average per nucleotide contact rate. and expected (white) rDNA contacts with each functional annotation for two sets of biological replicates in LCL. Expected numbers are calculated with the genome-wide per nucleotide contact rate. Shown is data for three selected annotations from 15-label genomic segmentation of hESC using ChromHMM [61]. (TIF) S1 Table. Summary table of Table. Number of 1MB segments displaying differential density of interactions with the 45S rDNA array in LCL and K562. Statistically significant segments ascertained with the EdgeR package. The number of segments with significant differences is shown for each chromosome. (TIF) S4 Table. Number of 1MB segments displaying differential density of interactions with the 45S rDNA array among five ESC-related cell types. Statistically significant segments ascertained with the EdgeR package. The number of segments with significant differences is shown for each chromosome. (TIF) S5 Table. Average number of contacts per nucleotide (CPN) obtained with three subsets of contacts. CPN obtained from (i) all rDNA-genome contacts (without controlling for pseudogenes), (ii) rDNA-genome contacts retrieved from 1MB bins without rDNA pseudogenes, and (iii) rDNA-gene contacts retrieved from genes without rDNA pseudogenes in them. Genomewide contacts were divided by the genome size or the total size of the bins without pseudogenes. rDNA-gene contacts were divided by the total length of the genome with genic sequences (protein-coding genes only). Shown are estimates for both 5S and 45S rDNA arrays. (TIF) S6