Copy Number Variation Is a Fundamental Aspect of the Placental Genome

Discovery of lineage-specific somatic copy number variation (CNV) in mammals has led to debate over whether CNVs are mutations that propagate disease or whether they are a normal, and even essential, aspect of cell biology. We show that 1,000N polyploid trophoblast giant cells (TGCs) of the mouse placenta contain 47 regions, totaling 138 Megabases, where genomic copies are underrepresented (UR). UR domains originate from a subset of late-replicating heterochromatic regions containing gene deserts and genes involved in cell adhesion and neurogenesis. While lineage-specific CNVs have been identified in mammalian cells, classically in the immune system where V(D)J recombination occurs, we demonstrate that CNVs form during gestation in the placenta by an underreplication mechanism, not by recombination nor deletion. Our results reveal that large scale CNVs are a normal feature of the mammalian placental genome, which are regulated systematically during embryogenesis and are propagated by a mechanism of underreplication.


Introduction
While the accumulation of somatic copy number variations (CNVs) has been proposed to be a result of the aging process, predisposing cell types to cancer progression and neurological diseases, an alternate hypothesis is that they are a normal-or even essential-part of cell biology [1,2]. In support of the latter, lymphocyte-specific CNVs in immunologically important genes generate the genetic diversity of receptor molecules critical to their function [3]. Although V(D)J recombination is found only in the immune system, recent reports hint that lineage-specific somatic CNVs may be essential for healthy cellular differentiation and function in a number of organs such as the liver, pancreas and skin [4,5]. It is unknown how these lineage-specific mammalian CNVs are formed-whether by a process similar to V(D)J recombination or by an alternative mechanism.
Although the role of many cell-type specific CNVs in mammals is unclear, lineage-specific CNVs are a normal aspect of cellular development in the fruit fly Drosophila melanogaster [6]. Lineagespecific CNVs form during Drosophila egg and larval development in polyploid cells via cycles involving DNA replication in the absence of cell division (endoreplication) [6]. In egg formation, somatic CNVs form by selective amplification of genomic regions containing chorion (eggshell) genes, which facilitates secretion of chorion proteins by the ovarian follicle cells [7,8]. Drosophila somatic CNVs can also arise due to underreplication of certain genomic regions in the salivary glands, fat body and midgut of the larva [9][10][11][12][13]. While CNVs in Drosophila polyploid cells have been observed for more than 70 years [14], it is not known whether a similar mechanism is present in mammalian cells. However, the recent observation of human tissue-specific CNVs [1][2][3][4][5] suggests that somatic CNVs are as essential in mammalian cells as they are in Drosophila.
Mammals absolutely require polyploid placental cells, corollaries to Drosophila follicle cells, for pregnancy maintenance [15]. In the placenta, polyploidy is restricted to specialized trophoblast cells that invade and remodel the uterus to promote vascularization and other maternal adaptations to pregnancy [15]. In rodents, these cells-termed trophoblast giant cells (TGCs), have 50-1,000 copies of the genome per cell. While proper TGC function depends on their polyploidy content [16,17], it is not known what aspect of polyploidy is necessary for fetal survival. As TGCs are a class of critical polyploid support cells analogous to Drosophila follicle cells, they may similarly use differential replication of the genome to achieve highly specialized function.
Previous studies have addressed possible CNVs in rodent TGCs. Ohgane et al. [18] used restriction landmark genomic scanning (RLGS) to analyze CpG islands in rat junctional zone TGCs during late gestation (days 18 and 20). They reported that $97% of the spots detected by RLGS were similar to diploid controls and therefore concluded that there are no TGC CNVs. Sher et al. [19] also argued against the existence of CNVs based on array Comparative Genomics Hybridization (aCGH) and quantitative real-time PCR experiments on mouse e9.5 implantation site TGCs. However, as there are several subtypes of TGCs which all have varying ploidy and functional significance during gestation [15,20], CNVs could be present in a subset of cell types or only at certain developmental time points. Of particular interest are parietal TGCs, which have the highest degree of polyploidy [15], and are therefore an excellent candidate for differential replication of the polyploid genome. Genetic mouse mutants affecting the parietal TGCs predominantly die before e12.5 [15][16][17], suggesting that this is when developmentally important CNV would be required.
Here we report that somatic CNVs are a normal part of placental cell biology. We utilized whole genome sequencing (WGS) and aCGH to identify 47 reproducibly underrepresented (UR) domains in mouse e9.5 parietal TGCs, totaling 6% of the genome. Employing a variety of genomic techniques, we demonstrate that UR domains are marked in chromatin prior to endoreplication in TGC progenitor cells and gradually form during the first half of gestation. UR domains are highly enriched for genes involved in cell adhesion and neurogenesis, as well as for gene deserts. Furthermore, we specifically show that UR domains are due to underreplication rather than somatic deletions. Together, these data reveal that lineage-specific CNVs are

Author Summary
Generally, every mammalian cell has the same complement of each part of its genome. However, copy number variation (CNV) can occur, where, compared to the rest of its genome, a cell has either more or less of a specific genomic region. It is unknown whether CNVs cause disease, or whether they are a normal aspect of cell biology. We investigated CNVs in polyploid trophoblast giant cells (TGCs) of the mouse placenta, which have up to 1,000 copies of the genome in each cell. We found that there are 47 regions with decreased copy number in TGCs, which we call underrepresented (UR) domains. These domains are marked in the TGC progenitor cells and we suggest that they gradually form during gestation due to slow replication versus fast replication of the rest of the genome. While UR domains contain cell adhesion and neuronal genes, they also contain significantly fewer genes than other genomic regions. Our results demonstrate that CNVs are a normal feature of the mammalian placental genome, which are regulated systematically during pregnancy.

Polyploid TGCs have recurrent and reproducible CNVs
To investigate whether the 50-1,000 genomic copies in polyploid TGCs are uniformly replicated or contain CNVs, we used aCGH to compare genomic regions of mouse parietal TGCs (TGCs) and 2N embryos at e9.5 ( Figure 1A, Figure S1A). We dissected four embryos and associated TGCs from one litter, representing pairs of genetically identical tissues, performed aCGH using the Agilent SurePrint G3 Mouse CGH Microarray Kit (two embryos/TGCs pooled per biological replicate), and analyzed the data using the R/Bioconductor package cghFLasso [21]. We identified 45 regions, reproducible between biological replicates, that were underrepresented within the TGC genome compared to the embryonic genome at a false discovery rate (FDR) of 0.0001, which we termed underrepresented (UR) domains ( Figure 1B, Table S1). UR domains range in size from 1,037 kb to 9,429 kb (Table S1). In addition to the 45 UR domains common to both replicates, we found 30 domains specific to only one replicate ( Figure 1B). However, when we reduced the FDR (to 0.01), 19/30 of these domains are found in both replicates, suggesting that while the degree of underrepresentation varies, UR domains form in specific regions of the genome. Importantly, we did not observe any overrepresented regions in TGCs (FDR = 0.0001).
We next asked whether UR domains were specific to TGCs, or whether they existed in diploid trophoblast cells or other endocycling polyploid cells. We used aCGH to compare the DNA of megakaryocytes (up to 64N) to embryos, placental disk cells (mostly 2N) to embryos, and cultured trophoblast stem cells (TS cells; 2N) to embryonic stem cells (ES cells; Figure 1C, Figure  S1B, Figure S2). Megakaryocytes have no detectable underrepresented regions and display one region of overrepresentation common to both replicates, indicating that TGC UR domains are not simply explained by endocycling (FDR = 0.0001; Table S2). Placental disk cells lack any over or underrepresentation (FDR = 0.0001; Table S3), although greatly reducing the FDR (to $0.05) revealed a weak trend towards UR domains in the same locations as in TGCs, likely explained by the normal presence of a small number of TGCs within this population ( Figure 1C, Figure  S2). Finally, we identified several TS and ES specific CNVs, but these were different from the TGC UR domains and presumably represent adaptations to cell culture (Tables S2 & S3) [22]. These data suggest that UR domains are important genomic features unique to TGCs. As Sher et al. [19] have argued against the existence of CNVs in e9.5 TGCs, we compared our aCGH data to theirs. Consistent with Sher et al., we did not find any CNVs in their data using the R/Bioconductor package cghFLasso and an FDR of 0.0001 [21]. However, greatly reducing the FDR (to .0.05) revealed a trend towards UR domains in the same locations as in our TGC data ( Figure S3), similar to the report by Sher et al. of finding reduced copy number using a smaller threshold. Moreover, the Sher et al. data bears a striking resemblance to our placental disk data ( Figure  S3), suggesting that their study, on implantation site TGCs, is on a population of trophoblast cells more akin to the placental disk than to the parietal TGCs of the mural trophectoderm described in our study. In support of this, while parietal TGCs surround the entire conceptus, TGCs over the central region of the placental disk are smaller and less polyploid than those at the periphery [20].
Together, these data suggest that the parietal TGCs of the mural trophectoderm not only have a higher degree of ploidy, but also have specific CNVs compared to the rest of the placenta.

Whole genome sequencing reveals UR domains in individuals
To quantitatively examine the extent of underrepresentation in TGCs, we performed paired-end WGS [23]. We sequenced (at 106 coverage) six individual e9.5 TGCs and their genetically matched embryos from three separate litters (2 individuals per litter; Table S4). To identify CNVs, we used a custom R/ Bioconductor program based on CNVnator [24], which identifies CNVs at a p-value of 0.01. We found 47 reproducible UR domains on the autosomes in e9.5 TGCs in all samples (Table S5). UR domains range from 75 kb to 8,965 kb and cover 6% of the genome (138 Mbs of 2,717 total Mbs; Table 1). We next calculated the fold depletion of each UR domain from the normalized log 2 ratio of sequence coverage of TGC/embryo [25] and found an average reduction between 27% and 51%, with a median between 28% and 54% (Table 1). Further, the size and degree of depletion of UR domains correlate such that the larger   Figure 2A).
Next, we examined how much variation existed between individuals. First, we compared aCGH and WGS data, and found 43 UR domains common to both platforms ( Figure 2B, Table 1, Figure S4, Table S1). Of the domains that differ, five additional domains in the WGS data are likely due to the greater sensitivity of WGS, as these domains can also be found in the aCGH data if the FDR is lowered (to 0.01). Three additional domains in the aCGH data are found in a majority of the WGS samples (present in four to five out of the six samples), suggesting a small amount of variability in UR domain formation (Tables S1 & S5). To examine this variability in more depth, we examined the six individual WGS samples. Besides the 47 UR domains common to all six samples, we also found underrepresented regions present in only a subset ( Figure 2C, Figure S5, Table S5). In general, samples with the least number of UR domains have a subset of the domains found in the samples with the most ( Figure 2C, Figure S5, Table   domains present at late gestation. Plot comparing position along chromosome 14 to the NLog2 Ratio of array intensity of TGC vs. embryo. Red: e9.5; blue: e11.5; green: e13.5; orange: e16.5. Two biological replicates are plotted for each stage. Dashed line: FDR = 0.0001. All autosomes shown in Figure  S7. E. Location of UR domains during the second half of gestation. Summary of results from both biological replicates of aCGH of TGCs from e9.5, e11.5, e13.5, and e16.5, all versus embryos (FDR = 0.0001). Darker green/longer bars indicate UR domains present in more replicates. Asterisks indicate the location of UR domains present at e9.5. F. Depletion of UR domains does not significantly change between e9.5 and e16.5, however, depletion of UR domains significantly differs between biological replicates at e13.5 and e16.5. Box plot compares percent median depletion of each biological replicate at stages e9.5, e11.5, e13.5, and e16.5. To compare with (C), aCGH data was normalized to WGS depletion levels (e9.5). Asterisks mark comparisons that are statistically significant (p,0.01). doi:10.1371/journal.pgen.1004290.g003 S5). In addition, the size of a particular UR domain is generally smaller in samples with fewer UR domains ( Figure 2D, Table S5). As the samples vary slightly in age, this suggests that UR domains amass over time, such that slightly younger placentas have fewer and smaller UR domains.

The number, size and degree of depletion of UR domains expands during early gestation
To test our hypothesis that UR domains develop over time, we performed WGS on e8.0 TGCs/embryos (one litter per replicate) and compared these results to e9.5. We found 24 domains common to both biological replicates at e8.0, versus 47 domains common to all samples at e9.5 ( Figure 3A & 3B, Figure S6). All e9.5 individuals have 23 of these domains with 5/6 individuals containing the remaining domain ( Figure 3B). We also found 10 domains unique to one of the two biological replicates at e8.0; 10/ 10 of these domains are contained in all e9.5 individuals ( Figure 3B). Finally, we found that both size and degree of depletion of UR domains significantly increase between e8.0 and e9.5 ( Figure 3C). Overall, as all UR domains at e9.5 are also present at e8.0, and UR domains at e9.5 are also more numerous, larger and more depleted, we propose that they are gradually established during early gestation.

New small and stochastic CNVs form in later gestation
We next asked whether the number and degree of depletion of UR domains continues to increases throughout development. We performed aCGH on TGCs/embryos collected from the second half of gestation-e11.5, e13.5, e16.5-and compared them to e9.5. Out of 45 UR domains present in both biological replicates at e9.5 (FDR = 0.0001), 22 of these are present in all biological replicates at e11.5, e13.5 and e16.5, and an additional 10 (32/45) are present in all samples except for one of the e16.5 replicates ( Figure 3D & 3E, Figure S7). We next examined size, and found that the 32 common domains are significantly larger than UR domains that arise later in development (the 147 not present at e9.5; Figure 3D & 3E, Figure S7). However, unlike between e8.0 and e9.5, where the degree of depletion expanded, we found no significant change from e9.5 to e16.5 ( Figure 3F). Although, UR domains slightly trend towards becoming less depleted over time ( Figure 3D & 3F, Figure S7). There is also more intrinsic variability later in gestation, as the median degree of depletion between biological replicates at both e13.5 and e16.5 is significantly different ( Figure 3F). The differences between UR domains in early (e8.0-e9.0) and later (e11.5-e16.5) gestation correlate with previous data showing that TGC polyploidy drastically increases until e10.5, and endocycling ends by e13.5 [20]. These data suggest that the increase in UR domain size and degree of underrepresentation from e8.0 to e9.5 is linked to the robust endocycles of early gestation. Furthermore, the termination of endocycles in later development may free cellular machinery to increase representation levels in UR domains.
We also found 33 overrepresented regions at e11.5-e16.5 that are not present at e9.5 ( Figure 3D & 3E, Figure S7). We examined gene content of overrepresented regions common to at least two staged biological replicates (10/33), but did not find any annotated genes. Thus, while new CNV regions form during late gestation, they are more stochastic, less reproducible, and significantly smaller than those conserved between all stages.

UR domains form during in vitro differentiation
We next examined whether UR domains are also generated in vitro when differentiating TS cells into TGCs. To this end, we performed aCGH on purified TGCs harvested at 3, 5 and 7 days after differentiation [26][27][28] (Figure S8). Similar to in vivo, in vitro cells generate the same UR domains and also develop these over time (FDR = 0.0001, Figure 4A & 4B, Figure S8). At day 3, only one biological replicate has any of the UR domains found in vivo at e9.5 (3/45 Figure 4A & 4B, Figure S8), strongly suggesting that the formation of these UR domains is a fundamental feature of TGC development.

UR domains are highly enriched for genes involved in cell adhesion and neurogenesis
Next, we asked whether genes contained within e9.5 TGC UR domains were enriched for certain biological functions. We found that UR domains are significantly depleted of both protein-coding and non-coding genes as expected by chance (386 observed vs. 617 expected, 0.636enrichment, p,0.001) and when compared to the rest of the genome ( Figure 4C). Further, these domains are significantly enriched for 1 Mb gene deserts (regions without any Ensembl annotations; 47 observed vs. 9 expected, 4.966 enrichment, p,0.001). In total, 386 genes are present within UR domains, 106 of which are functionally annotated. When we examined these 106 genes for function using GOTERMFINDER [29], the top enrichment categories are biological adhesion (p = 2.31610 29 ) and related categories, followed by neuron projection development (p = 4.23610 28 ), and related neurogenesis categories. These categories were not enriched when we performed the same analyses on a list of genes found in a random set of regions that have the same length and chromosome distribution. Finally, using 39 RNA-Seq (3SEQ) [30] from both in vivo and in vitro TGCs, we compared expression of the genes to the degree of representation and found that genes in UR domains are either not expressed or have much lower levels of transcription than genes in regularly represented regions ( Figure 4D & 4E). Overall, our data show that there are specific classes of genes enriched within the UR domains and these genes are generally not expressed, raising the possibility that UR domains function to limit the expression of a particular subset of genes in TGCs.

UR domains are heterochromatic
To test whether UR domains are characterized by a specific chromatin state, we performed ChIP-Seq using anti-H3K27ac, anti-H3K4me1, anti-H3K4me3, anti-H3K9me3, and anti-H3K27me3 in both in vitro TS cells and derived TGCs [31]. We used MACS2 to determine the normalized fold change for histone occupancy [32] and then used the Pearson correlation (R) to determine how the degree of representation (normalized log 2 of e9.5 WGS) correlates with signals from histone marks. In both TGCs and TS cells, we find that UR domains tend to co-localize with the repressive marks H3K9me3 and H3K27me3 ( Figure 5). Conversely, UR domains have underrepresentation of the active chromatin marks H3K4me3, H3K4me1 and H3K27ac ( Figure 5). These results demonstrate that UR domains do not occur in active regions of the genome and that they are marked in the 2N progenitor cells (TS cells). Interestingly, UR domains are only a fraction of genomic heterochromatin ( Figure 5B & 5C). All UR domains have increased signals for repressive histone marks and only weak signals for active histone marks. However, not all regions of the genome having repressive marks but not active marks are associated with a UR domain. Overall, this demonstrates that UR domains have a heterochromatic signature, but represent only a subset of heterochromatin. We further examined the relationship between UR domains and heterochromatin using an alternative statistical method. We asked whether the histone marks are significantly enriched or depleted in our defined list of UR domains compared to what would be expected by chance [31]. Similar to our correlation analysis, marks associated with transcriptional activation (H3K4me3, H3K4me1 and H3K27ac) are significantly depleted in UR domains (p,0.001; Table 2). Conversely, the repressive mark H3K9me3 is enriched within UR domains (p,0.001; Table 2). Interestingly, while the repressive mark H3K27me3 is also enriched within UR domains in TS cells, it is depleted within UR domains in TGCs (p,0.001; Table 2). This observation agrees with previous data where extraembryonic cells have lower levels of H3K27me3 methylation than embryonic cells [33], and suggests that H3k27me3 is not critical for UR domain maintenance. Together, our data show that UR domains have a heterochromatic signature, both in TGCs and in their 2N progenitors.

UR domains are not caused by deletions
To examine whether UR domains are caused by genomic deletions, we carried out somatic structural variant analysis using paired-end sequencing data from the six TGC and matched embryo samples with the program SMASH [34]. If UR domains are caused by acquired genomic deletions, we would expect to find multiple library inserts that fully span the deleted regions (''discordant'' paired-end reads; Figure S9). While we did detect C. Box plot analysis shows that UR domains are smaller than the late-replicating regions that contain them. Asterisk marks the comparison that is statistically significant (p,0.01). D. UR domains form from a subset of late-replicating regions. Diagram depicting late-replicating regions that contain UR domains versus ones that do not contain UR domains. E. Box plot analysis shows that the late-replicating regions that contain UR domains are significantly larger, but not significantly more late-replicating, than those that do not. Asterisk marks the comparison that is statistically significant (p,0.01). F. Box plot analysis shows that the late-replicating regions that contain UR domains have significantly fewer genes than those that do not (double asterisks). ''Shuffled'' refers to a random set of regions that have the same length and chromosome distribution. Asterisks mark comparisons that are statistically significant (p,0.01). doi:10.1371/journal.pgen.1004290.g006 sample-specific CNVs, we did not detect somatic deletions common to all of the six TGCs, but not the embryos. Moreover, the probability of not detecting a given deletion in each of the six samples is extremely low (p = 2N610 25 ). These data show that UR domains are not a result of somatic chromosomal deletions.

UR domains are late-replicating chromosomal segments
Since our WGS data does not support genomic deletions as the source of UR domains, we investigated whether they may be due to underreplication ( Figure S9B). In 2N cells, replication timing is precisely regulated such that specific regions of the genome are replicated early in S phase while others are replicated late in S phase [35]. To test whether UR domain formation is caused by incomplete replication of regions that are normally replicated late in 2N TS cells, we first generated a replication timing profile of TS cells. To this end, we captured early-and late-replicating regions in TS cells by pulsing an asynchronous cell culture with BrdU to label replicating DNA followed by FACS, and then used aCGH to compare early and late BrdU-containing DNA [36]. Next, we compared late-replicating regions in TS cells to UR domains. Using the Pearson correlation (R), we found that UR domains correlate with late replication ( Figure 6A). Also, 47/47 TGC UR domains reside within late-replicating regions in TS cells ( Figure 6B, Table S6). UR domains are significantly smaller than the late-replicating regions that they are nested in ( Figure 6C; Table S6), suggesting that they are a subset of these larger regions.
Finally, as only 45 of the 211 late-replicating regions contain a UR domain ( Figure 6D, Table S6), we asked what distinguishes the late-replicating regions that form UR domains from those that do not. While there is no significant difference in the degree of late replication between these classes, late-replicating regions that contain UR domains are significantly larger ( Figure 6E). However, size is not the sole characteristic determining where UR domains form, as not all regions greater than a certain size contain a UR domain. We next investigated gene content and found that late-replicating regions that contain UR domains also contain significantly fewer genes than those that do not ( Figure 6F). These regions are also preferentially enriched for 1 Mb gene deserts (58 observed vs. 18 expected, 3.166enrichment, p,0.001). Together, our data show that UR domains form from a specific class of latereplicating, heterochromatic regions with low gene content, suggesting that UR domains are not simply a byproduct of latereplicating heterochromatin, but are a precisely regulated subset.

Discussion
We report here the first mammalian example, outside of the immune system, of lineage-specific CNVs being an integral part of normal cell biology and development. Notably, we show that CNVs in placental cells form via a novel mechanism unrelated to V(D)J recombination. Using both aCGH and high-throughput WGS, we identified 47 reproducible underrepresented domains in mouse parietal TGCs totaling 138 Mbs, or 6% of the genome. We found that UR domains are highly enriched for genes involved in cell adhesion and neurogenesis, as well as for gene deserts. Furthermore, we specifically show that UR domains are due to underreplication of a specialized type of heterochromatin, rather than acquired genomic deletions. Our data reveal that lineagespecific CNVs are a normal aspect of the TGC genome that are established and regulated during gestation.

Establishment of UR domains may involve a novel chromatin remodeler
Only a subset of heterochromatic, late-replicating regions form UR domains, suggesting that UR domains are not simply a byproduct of late-replicating heterochromatin, but are precisely regulated. We propose that either this is dictated by genomic structure or that there are specific DNA binding proteins that define UR domains. We favor the latter model based on parallels found in Drosophila, whereby mutants for Suppressor of Underreplication (SuUR) have underreplicated domains that become replicated to normal levels [12,13,37]. However, SUUR protein does not appear to be present in species outside the Drosophilids, and we have not found any SuUR homologs in mice via BLAST, raising the possibility that presently unknown proteins in mammals may be regulating this process.

Lineage-specific CNVs in mammalian development
Lineage-specific CNVs are an overlooked aspect of the mammalian genome. Although recent data suggests that they are widespread [1][2][3][4][5], their identification and functional study has not been carried out systematically. Identification of CNVs may be particularly difficult to define in primary tissues, due to high background of cells lacking CNVs. In support of this, Abyzov et al. [4] found a low frequency of somatic CNV in human fibroblasts. Further, even in more homogenous populations, relatively small degrees of CNV may mask their presence. Van Heesch et al. [38] found tissue-specific CNVs in rat blood, brain, liver and testis, where the degree of underrepresentation does not exceed 50%. While Van Heesch et al. conclude that their findings were the result of systematic bias in DNA isolation procedures, they could never get rid of these CNVs using any analytical or experimental approach. Moreover, Manukjan et al. [39] suggest that Van Heesch et al. are identifying the signature of replication timing in their CNV analyses due to the use of proliferating cells. Intriguingly, this suggests that, analogous to polyploid TGCs in the placenta, underreplication may be crucial in organs containing a highly proliferative population of 2N cells.

Convergent evolution of CNVs in flies and mice suggests function
While CNVs in Drosophila polyploid cells have been characterized for more than 70 years [14], our work demonstrates for the first time that CNVs are a normal aspect of mammalian development. The rarity of endoreplicating polyploid cells in animals suggests that CNVs in mouse and Drosophila arose independently [6], and therefore may have species-specific differences. While Drosophila CNVs are typically 90% underrepresented, mouse CNVs are never more than 50%. We strongly suggest that there are UR domains in both mouse and Drosophila polyploid cells, and that the presence of these domains in both taxa is an example of convergent evolution due to similar selective pressures, indicative of functional importance. As both mice and flies have a fast rate of early development compared to related species, formation of UR domains could be an integral part of accelerating the cell cycle, and therefore be a key mechanism behind their rapid life cycles.
UR domains as a mechanism to drive TGC function UR domains are a unique feature of the TGC genome, suggesting that they play a central role in placental function and pregnancy. Consistent with this, UR domains are enriched for specific classes of genes involved in cell adhesion and neurogenesis. Intriguingly, there is evidence that downregulation of both classes of proteins is crucial for placental function. Downregulation of cell adhesion genes is necessary for trophoblast invasion in both mice and humans [40,41]. Further-and quite remarkably-Liao et al. [42] found that upregulation of genes in the SLIT/ROBO neuronal guidance system in the human placenta is associated with the pregnancy disease pre-eclampsia. UR domain formation could also enable TGCs to simply save materials and time, a hypothesis that has been proposed for polyploidy in general [43]. TGCs are essential during the first half of gestation, when it is absolutely critical for the rapidly growing embryo to establish a connection with the mother [15,44]. Formation of UR domains could allow for more rapid maturation of TGCs by allowing replication initiation to proceed without waiting for replication of nonessential regions of the genome. In support of this, UR domains represent a significant part of the genome, 6% (138 Mbs of 2,717 total Mbs), and therefore the cell would require considerable resources to fully replicate these regions. Together, functional evidence and convergent evolution suggest that UR domains are a critical element during pregnancy. Regardless, placental UR domains are the first mammalian example, outside of the immune system, of lineage-specific CNVs being an integral part of normal cell biology and development. USDA License 93-R-0004). Stanford APLAC and institutional guidelines are in compliance with the U.S. Public Health Service Policy on Humane Care and Use of Laboratory Animals. The Stanford APLAC approved the animal protocol associated with the work described in this publication.
Mice 129-Elite, C57BL/6 and pregnant C57BL/6 mice were obtained from Charles River. Copulation was determined by the presence of a vaginal plug the morning after mating, and embryonic day 0.5 (e0.5) was defined as noon of that day. TGCs and embryos were dissected in 16 PBS (1:10 106 PBS, pH = 7.4; Gibco) and stored on ice until further processing. After removal of the decidua, parietal TGCs of the mural trophectoderm [15] were dissected away from the placental disk, and, when possible, Reichert's membrane ( Figure S1A). TGCs were identified by their extremely large cell size ( Figure 1A). Using single-nucleotide polymorphism data from F1 crosses, TGCs were predicted to have, at the most, approximately 5% contamination by maternal cells (Hannibal & Baker, unpublished data). Placental disk tissue was gathered from e13.5 placental disks after the removal of the decidua and obvious parietal TGCs. For gathering 2N genomic DNA, at e8.0, the entire embryo was collected; at e9.5, the embryo body, after removal of obvious organs and head (removed at otic vesicle), was collected; and at later stages, limbs, or a mixture of limbs and the tail, were collected ( Figure S1A).

Nuclear staining
For confocal imaging, TGCs/embryos were fixed in 4% paraformaldehyde at 4uC overnight. Samples were stained with 0.5 mg/mL DAPI (Life Technologies) in 16 PBS overnight, washed in 50% glycerol/16 PBS and stored in 70% glycerol/16 PBS. Confocal images were taken on a Leica DM IRE2 inverted microscope using the Leica SP2 software package, located in the Stanford Cell Sciences Imaging Facility.

Cell culture
Trophoblast stem cells were cultured as described in Chuong et al. [31] following [27]. TS cells were differentiated into parietal TGCs by replacing the FGF, Activin and Heparin in the media with retinoic acid [27,28]. Mature TGCs are seen after 4-6 days of differentiation [26] and were collected on days 3, 5 and 7. TGCs/ TS cells were further isolated for aCGH by placing cultured cells over a two-step density gradient (1.5% BSA over 3% BSA in a 15 mL tube; Figure S1B). TGCs sank to the bottom of the tube while the smaller TS cells stayed in the upper fraction.
The embryonic stem cell line CGR8 is a germ-line competent cell line established from the inner cell mass of a 129 e3.5 male pre-implantation embryo [45]. ES cells were cultured feeder-free on 0.1% gelatin coated plates. The ES cell medium was prepared by supplementing knockout DMEM (Invitrogen) with 15% FBS, 1 mM glutamax, 0.1 mM nonessential amino acids, 1 mM sodium pyruvate, 0.1 mM 2-mercaptoethanol, penicillin/streptomycin, and 1000 units of leukemia inhibitory factor (LIF; Millipore). Cell culture was maintained at 37uC with 5% CO2.
Megakaryocytes were derived and cultured as described in [46]. Briefly, fetal livers were dissected from e13.5 C57BL/6 embryos in Hanks' Balanced Salt Solution and placed in DMEM with 10% FBS supplemented with 100 ug/mL penicillin-streptomycin (Invitrogen). Livers were pooled based on sex of the embryo (males pooled and females pooled separately). To make a single cell solution, livers were aspirated through a progression of 18G, 21G and 23G needles. To promote differentiation into megakaryocytes, cells were cultured for five days in media containing thrombopoietin (TPO; R&D Systems) at 37uC with 5% CO2. Successful differentiation was identified by 1) the presence of large cells (megakaryocytes) and by 2) FACS to confirm up to 32N ploidy. For FACS, propidium iodide stained samples were run on a Cytek DxP10 modified Facscan (Cytek Technologies, BD Biosciences) using the blue laser. Approximately 10,000 events per sample were collected. Data was analyzed using FlowJo (Treestar, Inc.). Megakaryocytes were isolated for aCGH by placing cultured cells over a two-step density gradient (1.5% BSA over 3% BSA in a 15 mL tube; Figure S1B). Megakaryocytes sank to the bottom of the tube while smaller, undifferentiated, cells stayed in the upper fraction.

ArrayCGH and whole genome sequencing
Genomic DNA was extracted from fresh tissue and cultured cells using the DNeasy Blood & Tissue Kit (Qiagen). Before column purification, in vivo and in vitro samples were digested with proteinase-K (600 mAU/ml solution or 40 mAU/mg protein) overnight and for 10 minutes, respectively, at 56uC, followed by a 4 minute incubation with RNase A (100 mg/mL; Qiagen DNeasy Blood & Tissue Kit). If necessary, DNA was further concentrated via ethanol/sodium acetate precipitation following standard protocols.
For arrays performed on DNA from TGCs, placental disks and embryonic controls, genomic DNA from two individuals in the same litter were pooled for each condition. For megakaryocyte arrays, cells derived from 5-6 livers from a single litter were pooled for each condition. For controls for the megakaryocyte array, three embryos (subset of the litter from which livers were collected from) were pooled for each condition. For arrays performed on DNA from cultured cells, two replicates from different passages were used (5 million cells each). For each condition, approximately 4 mg DNA was sent to the Biomedical Genomics Core at the Research Institute at Nationwide Children's Hospital (Columbus, OH) for processing with the SurePrint G3 Mouse CGH Microarray Kit, 46180 k (Agilent). For all arrays performed on DNA from in vivo tissue, to ensure that the arrays detect copy number variation, duplicates consist of 1) female test versus male control and 2) male test versus female control.
aCGH data was analyzed using the R/Bioconductor package cghFLasso, which utilizes reference arrays in conjunction with a FDR [21]. An FDR of 0.0001 was used in order to examine all of the autosomes simultaneously. To determine which array to use as the reference, several analyses were performed. The TS versus ES array exhibited specific CNVs, presumably due to genomic adaptations to culturing [22]. The megakaryocytes displayed only a small region of overrepresentation and the placental disk array did not display any CNVs (FDR = 0.0001). However, as the placental disk has a small amount of underrepresentation in reproducible areas of the genome (FDR$0.05), the megakaryocyte array was used as the reference for the remainder of the analyses. aCGH data was plotted using cghFLasso [21]. For comparison with data from Sher et al. [19], data was retrieved from Gene Expression Omnibus series: GSE45787. To compare aCGH data from Sher et al. to data presented here, results were plotted using a custom R/Bioconductor program.
For WGS, for stages e9.5 and older, genomic DNA from one individual was used for each replicate, and for stage e8.0, 5-7 individuals from one litter were used for each replicate. Libraries for WGS were prepared from 40-50 ng genomic DNA using the Nextera TruSeq Dual Index Paired End Kit (Illumina) following manufacturer's instructions with the following modification: the Qiagen MinElute Reaction Cleanup Kit (Qiagen) was used to cleanup Tagmented DNA. Library quality was assessed using Qubit and Bioanalyzer, and sequenced on the Illumina HiSeq 2000 at approximately 106 coverage (Table S4) at the Stanford Center for Genomics and Personalized Medicine. 101 bps from each of the paired-ends were sequenced and sequencing reads were aligned using either the DNAnexus mapper [47] or the Novocraft Novoalign program against the mouse reference genome (mm9). Data was analyzed using custom R/Bioconductor programs and SMASH [34]. To compare aCGH versus WGS data, results were plotted using a custom R/Bioconductor program.
The final UR domain list was generated using e9.5 WGS data and a custom R/Bioconductor program with the following criteria: neighboring data points with normalized log 2 ratio of TGCs/embryo #20.3. These criteria were decided upon based on the program CNVnator [24], which, while identifying UR domains with both large and small degrees of underrepresentation at a p-value of 0.01, systematically missed UR domains that are closely spaced together, which our program rectifies.

Enrichment statistics
To calculate the significance of overlap between datasets, a binomial test was used to determine whether the observed overlap for the datasets was significantly greater than an expected overlap based on the average of 1,000 randomized datasets [31]. To randomize each dataset, regions were shuffled within bins according to their chromosomal distribution and distance from gene transcriptional start sites (including 1 kb, 10 kb, 100 kb, 1,000 kb, and .1,000 kb bins).

3SEQ
Total RNA was extracted from fresh in vivo tissue by homogenizing in TRIzol Reagent (Life Technologies/Ambion) and total RNA was prepared following manufacturer's instructions. Total RNA from three individuals from the same litter were combined to make each library. mRNA was isolated from 10-20 mg of total RNA using Dynabeads Oligo(dT) 25 (Life Technologies/Ambion). 3SEQ Libraries were prepared from mRNA following [30]. Briefly, mRNA was heat sheared for 7.5 minutes to produce an average fragment size range of 100-400 bp, then used to generate cDNA libraries using a custom oligo dT primer containing Illumina-compatible adapter sequence. cDNA fragments were end-repaired and ligated to standard Illumina adapters. Size-selection was performed using E-gel SizeSelect agarose gels (Invitrogen), products were PCR amplified for 15 cycles and purified using Ampure XP beads (Beckman Coulter). Library quality was assessed using Qubit and Bioanalyzer, and sequenced on the Genome Analyzer IIx at the Stanford Center for Genomics and Personalized Medicine.
Total RNA was extracted and 3SEQ libraries were constructed for cultured TGCs as described in Chuong et al. [31]. Two replicates from different passages (10 million cells each) were used. 3SEQ data for TS cells was retrieved from Gene Expression Omnibus series: GSE42207 [31].
Sequences were aligned to the mouse (mm9) genome using the DNAnexus mapper [47] and raw counts for sense reads were analyzed using Unipeak 1.0 [48]. Regions of transcription were associated with the nearest ENSEMBL gene 39 UTR within 5 kb. Data were normalized and expression levels were analyzed using the R/Bioconductor package DESeq [49].

ChIP-seq
ChIP-seq and ChIP-seq analysis were performed as described in Chuong et al. [31] using the ChIP Assay kit (Millipore) following manufacturer's instructions. Briefly, 20 million cultured TGCs were cross-linked in 2% formaldehyde for 15 minutes, and sonicated for 12 cycles (30 seconds on/off) at 60% amplitude to produce a fragment range of 300-600 bp. Immunoprecipitation was performed with 2-5 mg of antibody (H3K4me3: ActiveMotif, 39159; H3K27me3: ActiveMotif, 39535; H3K27ac: Abcam, ab4729; H3K9me3: Abcam, ab8898; H3K4me1: Abcam, ab8895) conjugated to 50 ml of protein G Dynabeads (Invitrogen) overnight. Following washing and elution of DNA per manufacturer's instructions, libraries were prepared using the Illumina genomic DNA preparation kit using barcoded linker adapters, and sequenced on the Illumina HiSeq 2000 at the Stanford Center for Genomics and Personalized Medicine. ChIP-Seq data for TS cells was retrieved from Gene Expression Omnibus series: GSE42207 [31].
High-quality reads were aligned to the mm9 genome assembly using BWA 0.5.9 [50], retaining only unique alignments. Peaks were called using MACS2 2.0.10 [32]. The ''bigwig_correlation'' script from the Cistrome package [51] was used to generate genome-wide correlation plots between ChIP profiles and underrepresented profiles.

Replication timing
Cultured TS cells were incubated for two hours at 37uC in the dark with a final concentration of 100 mM BrdU (Sigma Aldrich B5002). Genome-wide replication timing was analyzed as previously described [36]. Briefly, cells were dissociated into a singlecell suspension and nuclei were isolated. DNA was subsequently stained with propidium iodide and cells were FACS sorted into early and late S-phase fractions based on their DNA content. DNA from early and late S-phase fractions was purified by immunoprecipitation of the BrdU-substituted nascent DNA (BrdU-IP). Three replicates from different passages (two million cells each) were used. Data was normalized following [36]. The R/ bioconductor package DNAcopy was used to define replication timing domains based on the similarity in values (a constant value across a segment defines a domain) [36]. Regions called by DNAcopy were confirmed on the genome browser. The ''bigwig_correlation'' script from the Cistrome package [51] was used to generate genome-wide correlation plots between replication timing profiles and underrepresented profiles.

Accession codes and data availability
SuperSeries Gene Expression Omnibus (GEO) accession number for aCGH, 3SEQ, ChIP-Seq, and replication timing data: GSE50585.
Smoothed replication timing data can also be found at: http:// www.replicationdomain.com/ BioProject accession number for WGS: PRJNA213010 Supporting Information Figure S1 Collection of polyploid and 2N cells.  Top: A sequencing library is made from a genome containing a deletion between A and B. Some of these reads will span the deleted region (red arrowheads). Paired-end reads (red arrowheads) are 101 bp reads flanking an approximately 500 bp unsequenced region (red line). Bottom: Sequenced reads (red arrowheads) are aligned to the reference genome, which does not contain the deletion between A and B. If the distance between the paired-end reads is greater than the expected insert size (''discordant'' paired-end read), then this indicates a deletion in the sequenced genome compared to the reference genome. Here, instead of mapping 500 bps apart, the paired-ends map 10,000 bps apart (red dotted line), suggesting a deletion.  There is only one overrepresented region in megakaryocytes (containing the following annotated genes: Pisd-ps1, Sfi1). This region is located at the end of a chromosome (Chr 11), which suggests that it is an artifact. As both cultured TS and ES cells may have underrepresentations/overrepresentations due to culturing [22], underrepresentations/overrepresentations in TS cells could also be overrepresentations/underrepresentations in ES cells.
Putative underreplicated regions in TS cells generally do not correspond to UR domains in e9.5 TGCs. (XLSX)