Islands of retroelements are major components of Drosophila centromeres

Centromeres are essential chromosomal regions that mediate kinetochore assembly and spindle attachments during cell division. Despite their functional conservation, centromeres are among the most rapidly evolving genomic regions and can shape karyotype evolution and speciation across taxa. Although significant progress has been made in identifying centromere-associated proteins, the highly repetitive centromeres of metazoans have been refractory to DNA sequencing and assembly, leaving large gaps in our understanding of their functional organization and evolution. Here, we identify the sequence composition and organization of the centromeres of Drosophila melanogaster by combining long-read sequencing, chromatin immunoprecipitation for the centromeric histone CENP-A, and high-resolution chromatin fiber imaging. Contrary to previous models that heralded satellite repeats as the major functional components, we demonstrate that functional centromeres form on islands of complex DNA sequences enriched in retroelements that are flanked by large arrays of satellite repeats. Each centromere displays distinct size and arrangement of its DNA elements but is similar in composition overall. We discover that a specific retroelement, G2/Jockey-3, is the most highly enriched sequence in CENP-A chromatin and is the only element shared among all centromeres. G2/Jockey-3 is also associated with CENP-A in the sister species D. simulans, revealing an unexpected conservation despite the reported turnover of centromeric satellite DNA. Our work reveals the DNA sequence identity of the active centromeres of a premier model organism and implicates retroelements as conserved features of centromeric DNA.


Introduction
Centromeres are marked by the histone H3 variant centromere protein A (CENP-A; also called centromere identifier [Cid] in Drosophila and centromeric histone H3 [CenH3] in plants), which is necessary and sufficient for kinetochore activity [1,2]. Although epigenetic mechanisms play a major role in centromere identity and propagation [3], centromeric DNA sequences can initiate centromere assembly in fission yeast [4] and humans [5], and centromeric transcripts play a role in centromere propagation in human cells [6], suggesting that centromeric DNA-encoded properties may contribute to centromere specification [7]. However, our current understanding of most centromeres remains at the cytological level, as metazoan centromeres are embedded in highly repetitive, satellite-rich pericentric heterochromatin and thus are largely missing from even the most complete genome assemblies. Only recently, long-read single molecule sequencing technologies have made it possible to obtain linear assemblies of highly repetitive parts of multicellular genomes such as the human Y chromosome centromere [8] and maize centromere 10 [9].
Drosophila melanogaster provides an ideal model to investigate centromere genomic organization, as it has a relatively small genome (roughly 180 Mb), organized in just three autosomes (chromosome 2, 3, and 4) and two sex chromosomes (X and Y) [10]. The estimated centromere sizes in Drosophila cultured cells range between approximately 200 and 500 kb [11] and map to regions within large blocks of tandem repeats [12][13][14][15]. While CENP-A associates with simple satellites in chromatin immunoprecipitation sequencing (ChIP-seq) data [16], it may bind to additional undiscovered sequences. The linear organization at the sequence level of any of the centromeres is unknown in this species. Early efforts to determine the structural organization of centromeres in D. melanogaster combined deletion analyses and sequencing of an X-derived minichromosome, Dp1187. These studies mapped the minimal DNA sequences sufficient for centromere function to a 420-kb region containing the AAGAG and AATAT satellites interspersed with "islands" of complex sequences [14,15]. However, it is unclear which parts of this minimal region comprise the active centromere, whether it corresponds to the native X chromosome centromere, and if other centromeres have a similar organization. By and large, satellites have been regarded as the major structural elements of Drosophila, humans, and mouse centromeres [2,3,17].
In this study, we reveal the detailed organization of all functional centromeres in D. melanogaster. By mapping CENP-A on single chromatin fibers at high resolution, we discover that CENP-A primarily occupies islands of complex DNA enriched in retroelements, which are flanked by large blocks of simple satellites. Our genomic analyses show that all centromeres have a unique sequence organization, even though many of the centromeric elements are shared among centromeres. In particular, all centromeres are enriched for a non-long terminal repeat (non-LTR) retroelement in the Jockey family, G2/Jockey-3. Although none of these elements are specific to centromeres, they are significantly enriched within these regions. We also find G2/Jockey-3 enriched at the centromeres of D. simulans, which has centromeric satellite arrays highly divergent from those of D. melanogaster [16]. Collectively, these data are consistent with the model that retroelements may have a conserved role in centromere specification and function, as proposed for other species (for review, see [18]).

Identification of candidate centromeres by long-read sequencing and ChIP-seq
To identify the centromeric DNA sequences of D. melanogaster, we combined a long-read genome assembly approach [19] with four replicate CENP-A ChIPs on chromatin from D. melanogaster embryos, followed by paired-end Illumina sequencing (ChIP-seq). We also performed ChIP-seq in D. melanogaster Schneider 2 (S2) cells, a widely used model for cell division studies. We took four complementary approaches to discover regions of the genome enriched for CENP-A: (1) identifying simple repeats enriched for CENP-A based on kmers, (2) mapping reads to a comprehensive repeat library to summarize enriched transposable elements (TEs) and complex repeats, (3) using de novo assembly methods to assemble contigs from the ChIP reads and calculating enrichment relative to input post hoc, and (4) mapping reads to a heterochromatin-enriched assembly [19] and calling ChIP peaks (Fig 1A).
In our ChIP experiments, CENP-A pulls down simple satellites, consistent with a previous study [16]. Among the kmers most enriched in CENP-A ChIP relative to input are the dodeca satellite and its variants and complex kmers that include tandem (AATAG) n and (AATAT) n repeats (Fig 1B, S1 Fig and S1 Table). Prodsat (Prod satellite; also known as the 10-bp satellite) is enriched in the CENP-A ChIP but not relative to input (Fig 1B). In addition to satellites, we found that CENP-A is also strongly associated with retroelements, particularly non-LTR long interspersed nuclear element (LINE)-like elements in the Jockey family and with the intergenic spacer of the ribosomal genes (IGS). Among the Jockey elements, the most highly enriched in CENP-A ChIPs are annotated as G2 and Jockey-3 ( Fig 1C and S2 Table). Our phylogenetic analysis suggests that G2 and Jockey-3 correspond to the same type of element, as genomic copies of these two elements are interleaved across the tree and not monophyletic (S2 Fig). Thus, we hereafter collectively refer to these elements as G2/Jockey-3.
To detect CENP-A-enriched sequences independently of known repeats in repeat libraries or of genome assemblies, we de novo assembled CENP-A ChIP reads into contigs (i.e., ChIPtigs [20]) and calculated their CENP-A enrichments. The resulting CENP-A-enriched ChIPtigs primarily contained fragments of TEs, other complex repeats, and some simple satellite repeats (S3 Table).
To determine the genomic location of CENP-A-enriched sequences, we mapped ChIP reads to a new reference genome assembly that we generated using a heterochromatinenriched assembly method resulting in greater representation of heterochromatin-associated regions [19] (S4 Table and S1 Text). Five contigs were consistently the most CENP-A enriched in the assembly, with highly reproducible ChIP peaks across technical and biological replicates (irreproducible discovery rate [IDR] < 0.05; S3 Fig and S5 Table). These CENP-A-enriched contigs have a similar organization: they contain islands of complex DNA (e.g., TEs) flanked by simple tandem satellite repeats with known centromeric locations (Fig 2, S4 Fig and Table 1). The candidate centromeric contig for the X chromosome (Contig79) is 70 kb and contains a 44-kb island of complex DNA (called Maupiti [15]), flanked by a short stretch of AAGAT satellite on one side and embedded in AAGAG satellite (Fig 2A). This region has an organization that is nearly identical to that of the Dp1187 minichromosome putative centromere [14,15], suggesting that this contig may contain at least part of the endogenous X centromere (CenX). The candidate centromeric contig for chromosome 4 (Contig119) contains a 42.8-kb island (we named Lampedusa) flanked by the AAGAT satellite ( Fig 2B). This contig is consistent with the cytological location of the AAGAT satellite on chromosome 4 and with a recent report on the centromere of a B chromosome derived from chromosome 4 [21]. The candidate centromeric contig for chromosome Y (Y_Contig26) consists of a 138-kb island (we named Lipari; Fig 2C). The candidate centromeric contig for chromosome 3 (Contig 3R_5) contains a 68.5-kb island (we named Giglio) flanked by Prodsat and the dodeca satellite, which map to this centromere cytologically [12,22,23] (Fig 2D). Finally, the candidate contig for chromosome 2 (tig00057289) contains a small 1.8-kb complex island (we named Capri) flanked by the AATAG and AAGAG satellites (Fig 2E). The majority of the top enriched de novo ChIPtigs (88/100 for R1, 19/30 for R2, 26/30 for R3, and 82/100 for R4) map uniquely to these five contigs (S3 Table), providing independent support for the assembly and further substantiating our hypothesis that these contigs correspond to the centromeres.

Genomic distribution of CENP-A in embryos and S2 cells
Our ChIP-seq experiments and their analyses provide evidence that CENP-A is specifically associated with the island DNA sequences for Contig79 (X Maupiti ), Contig119 (4 Lampedusa ), Y_Contig26 (Y Lipari ), and 3R_5 (3 Giglio ) and with a single interspersed G2/Jockey-3 fragment within tig00057289 (2 Capri ; Fig 2 and S4 Fig). A previous study that used a D. melanogaster native ChIP-seq dataset (using anti-green fluorescent protein [GFP] antibodies and CENP-A-GFP-expressing embryos) focused exclusively on the quantification of simple repeats and did not identify any complex DNA associated with CENP-A [16]. However, our reanalysis of this dataset showed association of CENP-A-GFP with the centromere islands (S3B Fig and S4 and S6 Tables). We validated individual elements for which we could design contig-specific quantitative PCR (qPCR) primers in additional independent CENP-A ChIP experiments and confirmed that the CENP-A peaks in these regions are not a result of library amplification bias from ChIP-seq (S5 Fig) [31].
Having shown that CENP-A is associated with the complex islands, we next analyzed if the centromere extends to the surrounding satellite DNA. Simple sequences flanking the islands . Different repeat families are color coded (see legend; note that Jockey elements are shown in one color even though they are distinct elements). Shown are the normalized CENP-A enrichment over input (plotted on a log scale) from one replicate (replicate 2, other replicates are in S4 Fig) colored in gray for simple repeats and black for complex island sequences. Although the mapping quality scores are high in simple repeat regions, we do not use these data to make inferences about CENP-A distribution (see text for details). The coordinates of the significantly CENP-A-enriched ChIPtigs mapped to these contigs (black), and the predicted ChIP peaks (orange) are shown below each plot. See also S4 Fig and S3 Table and S1 Fig). However, it is difficult to quantify the enrichment of CENP-A on simple satellite repeats for several reasons: (1) simple satellite sequences may be over-or underrepresented as an artifact of library preparation [31], particularly for ChIP-seq experiments that rely on PCR amplification to construct libraries; (2) satellites are abundant genomic sequences that are largely missing from whole genome assemblies [10], making it difficult to precisely quantitate how much of these sequences exist in genomes (and therefore how much to expect in the input); (3) highly abundant repeats are expected to have a low signal-to-noise ratio if only a small fraction of a simple repeat is enriched in CENP-A relative to the overall abundance of this satellite in the genome; and (4) simple satellite repeats present a challenge for even long read-based genome assembly methods [32]. Whereas we are confident in large-scale structural features of our assembly involving highly repetitive sequences, we observe even PacBio read depth in islands but not on simple satellites (S6 Fig), giving us less confidence in the base pair resolution of the assembly at simple repeats. Because of these limitations, we caution against using strictly assembly-based approaches in regions with simple repeats. Nonetheless, we report the ChIP peaks on simple satellites (shaded in gray in Fig 2). To confirm satellite localization near each centromere, we employed immunofluorescence (IF) with anti-centromere protein C (CENP-C; an inner kinetochore protein that colocalizes with CENP-A), followed by fluorescence in situ hybridization (FISH) with probes for the satellites dodeca, AAGAG, AATAT, AAGAT, AATAG, and Prodsat on metaphase chromosome spreads from third instar larval brains (S7 Fig); a summary of the colocalization data is shown in Table 1.
Although CENP-A localizes exclusively to the centromeres at the cytological level, it is possible that low levels of CENP-A occupy noncentromeric DNA. We found a low but consistent CENP-A enrichment at genomic regions outside of the centromere islands, including some telomere-associated elements (e.g., TART-A), rDNA genes from the rDNA clusters, and the LINE-like retroelements DMRT1B and R2 (Fig 1C, S4 Table and S1 Text). Many of these Table 1. Location of centromeric and centromere-proximal satellites in D. melanogaster. Locations of satellites on chromosomes X, Y, 2, 3, and 4 according to previous reports and our observations in this report by IF-FISH in the D. melanogaster sequenced strain iso-1. Each satellite location is characterized as being centromeric (overlaps with CENP-C), pericentric (juxtaposed to CENP-C), or heterochromatic (more distal than pericentric). Note that the dodeca satellite includes its variants and that Prodsat is also known as the 10-bp satellite.

Previous Reports This Study
Satellite Sequence Cen Peri Het Cen Peri Het associations likely represent nonspecific peaks [33], as they were not highly enriched in CENP-A ChIP-qPCR (S5 Fig). However, previous studies found evidence for an association of some centromeric proteins with the nucleolus [34], perhaps relating to the possible association between CENP-A and rDNA or rDNA-associated retroelements (e.g., R2) that we detect. We also noted that noncentromeric copies of G2/Jockey-3 were not consistently enriched in CENP-A (S8 Table). CENP-A ChIP-seq reads from S2 cells showed a similar enrichment profile of sequences represented in the embryo ChIP-seq data (e.g., IGS and G2/Jockey-3) but were much more enriched for additional retroelements that were not represented within our centromere contigs (e.g., LTR elements Dm1731, HMSBeagle, and Max-I; S2 Table). We also observed a similar pattern of CENP-A enrichment on simple satellite repeats in S2 cells (AATAT, AATAG, AAGAG, Prodsat, and dodeca; S1 Table), and we confirmed that these satellites are near centromeres cytologically using IF-FISH in S2 cells (S8 Fig). However, complex satellites that are pericentric in embryos, including complex satellites in the 1.688 family and Responder (Rsp), are CENP-A-enriched in S2 cells (S2 Table). This suggests that the centromeres of S2 cells may have expanded into regions that are pericentromeric in flies; the additional retroelements enriched in CENP-A may be pericentric or they may represent new retroelement insertions occurred in this cell line. Our findings are consistent with the extensive structural rearrangements and aneuploidy reported for these cells [35].

Centromeres are unique but are composed of similar non-LTR retrotransposons
Although each island has a distinct arrangement of AT-rich sequences, repeats, and TEs, their composition is overall similar. In particular, non-LTR retroelements in the Jockey family such as G2/Jockey-3, Doc, and Doc-2 are especially abundant within CenX, Cen4, and CenY (Figs 2 and 3A). G2/Jockey-3 is the only element present in all five of our centromere contigs, suggesting a potential role in centromere function or specification. In our phylogenetic analysis of genomic G2/Jockey-3 repeats in D. melanogaster, we cannot distinguish G2/Jockey-3 elements at centromeres from those across the genome, suggesting that centromeric copies do not have a single origin (Fig 3B and S1 Text). Although G2/Jockey-3 is not unique to centromeres and thus cannot be sufficient for centromere identity, it is significantly enriched at centromeres: approximately 63% of all genomic copies of G2/Jockey-3 are found within our candidate centromere contigs (Fig 4 and S9 Table). G2/Jockey-3 elements show signs of recent or ongoing activity based on their insertion polymorphism [36], pattern of 5 0 truncation (see S1 Text and Dryad repository file 13: https://doi.org/10.5061/dryad. rb1bt3j [37]), and expression (S9A Fig). At least some of this expression comes from the centromeres: we analyzed total embryo RNA extracts by reverse-transcription qPCR (RT-qPCR) using primers targeting centromere-associated copies and found evidence for low levels of G2/Jockey-3 transcription from copies in CenX, Cen4, and Cen3. We found no or negligible expression from the G2/Jockey-3 copies that we measured on CenY and centromere Cen2 (S9B Fig).
In addition to G2/Jockey-3, the 3 Giglio island has 240 copies of a centromere-enriched variant of the ribosomal IGS (S1 Text and S10 Fig). Among the islands, 2 Capri differs the most, being the smallest and harboring only a single fragment of G2/Jockey-3 ( Fig 2E). As was previously reported for the X-derived Dp1187 centromere [14,15], none of the sequences contained within these islands are exclusive to centromeres. However, several of these elements are enriched in these regions compared to the genome in addition to G2/Jockey-3.  Table).

Validation of centromeric contigs
To verify the association of our contigs with the centromeres, we performed IF with anti-CENP-C antibodies, followed by FISH with satellite probes and custom-designed Oligopaints libraries [38] (see Materials and methods) for X Maupiti , 4 Lampedusa , Y Lipari , and 3 Giglio (Fig 5, S12 Fig, and S1 Text). The X Maupiti Oligopaints hybridized to CenX as well as CenY on third instar male larval brain metaphase spreads (Fig 5A and S12A Fig). Similarly, the Oligopaints for 4 Lampedusa hybridized to Cen4 as well as to CenY (Fig 5B and S12B Fig), suggesting that Oligopaints for X Maupiti and 4 Lampedusa have homology to sequences at or near CenY. In contrast, the Oligopaints for Y Lipari (Fig 5C and S12C Fig) and 3 Giglio were specific for their respective centromeres (Fig 5D and S12D  Fig). We could not use Oligopaints to validate 2 Capri because of its small size, but its organization, with the AATAG and AAGAG satellites flanking a small CENP-A-enriched island (Fig 2E), is consistent with our FISH analyses ( Fig 5E). In line with the CENP-A ChIP-seq data, we observed significant differences between S2 cells and embryo centromeres by Oligopaint FISH. With the exception of 3 Giglio , centromeric island organization in S2 cells is dramatically different from larval brain metaphase spreads (S13 Fig and S1 Text), in contrast to the conservation of the centromeric distribution of simple satellites (S8 Fig). D. melanogaster centromeres tend to cluster in the interphase nucleus cytologically [39,40]. We found independent support for the complex islands being centromeric by analyzing previously published Hi-C data from D. melanogaster embryos [41]. Island-island interactions were  [37]. The underlying data can be found in S1 Data. (B) Maximum-likelihood phylogenetic tree based on the entire sequence of all G2/Jockey-3 copies in D. melanogaster inside (squares) and outside (circles) of centromeric contigs and on the consensus repeat in its sister species D. sechellia and D. simulans and a more distantly related species (D. yakuba). The tree shows that centromeric G2/Jockey-3 elements do not have a single origin (see Dryad repository files 13 and 15: https://doi.org/10.5061/dryad.rb1bt3j [37]). Cen2, centromere 2; Cen3, centromere 3; Cen4, centromere 4; CenX, X centromere; CenY, Y centromere; ETS, external transcribed spacer; IGS, intergenic spacer of the ribosomal genes; ITS, internal transcribed spacer; LTR, long terminal repeat; Prodsat, Prod satellite; TART, Telomere-associated retrotransposon. among the most frequent interchromosomal interactions, followed by interactions between islands and their own proximal pericentric heterochromatin and lastly by interactions between islands and distal pericentric heterochromatin or euchromatin (S14 Fig and S1 Text). This analysis also shows that indeed native centromeres interact with one another physically in the 3D nucleus.

Analysis of extended chromatin fibers reveals that CENP-A primarily occupies the islands
Based on the enrichment of CENP-A with island-associated repeats, we hypothesized that the TE-enriched islands are major centromere components in D. melanogaster. To test this, we investigated CENP-A occupancy, a direct reflection of centromere activity, and estimated the size of each centromere by visualizing extended chromatin fibers [11,42]. This method has two major advantages: it does not rely on mapping low complexity ChIP-seq reads, thus providing more information that can be inferred by this method, and it affords single-chromosome, rather than population, information on CENP-A localization. We carried out IF with anti-CENP-A antibodies and FISH with Oligopaint and satellite probes on cells from third instar larval brains, selecting females to ensure specificity for our X Maupiti and 4 Lampedusa Oligopaints ( Fig 5). First, we calibrated our fiber stretching using three FISH probes spanning 100 kb: two heterochromatic (one for the Rsp locus [43] and one Oligopaint targeting the pericentromere of chromosome 3L; see Materials and methods for coordinates) and one euchromatic (an Oligopaint targeting a region approximately 600 kb from the telomere of chromosome 3L; see Materials and methods). The estimated stretching for these fibers is approximately 10.1 kb/ μm for all three locations, with no significant difference among them (P = 0.085; S15 Fig). We next determined the sizes of the CENP-A domain and corresponding island of each centromere (Fig 5 and S16, S17, S18, S19, S20 and S21 Figs). The size of the CENP-A domain varies between centromeres, ranging in mean size between 101 and 171 kb (about 11-17 μm), smaller Percent overlap corresponds to CENP-A domain length/Oligopaint FISH length. The difference between the sizes of the CENP-A domain and the corresponding islands is significant (unpaired t test). Additional fibers are shown in S16, S17, S18, S19, S20 and S21 Figs. Cen2, centromere 2; Cen3, centromere 3; Cen4, centromere 4; CENP-A, centromere protein A; CENP-C, centromere protein C; CenX, X centromere; CenY, Y centromere; FISH, fluorescence in situ hybridization; IF, immunofluorescence; IGS, intergenic spacer of the ribosomal genes; n/a, not applicable; Prodsat, Prod satellite. than previous estimates that relied on the measuring of a mixture of centromeres in Drosophila Kc and S2 cells [11]. This is consistent with our ChIP-seq analysis suggesting that S2 cells may have expanded centromeres. X, Y, and 2 are the largest centromeres, whereas 3 and 4 are the smallest (Fig 5F and 5G). CENP-A primarily occupies the centromeric islands X Maupiti , 4 Lampedusa , Y Lipari , and 3 Giglio (about 70% of the CENP-A domain overlaps with the Oligopaint FISH signal; Fig 5G and S16, S17, S18, S19, S20 and S21 Figs). In some fibers, the X Maupiti Oligopaint FISH signal showed interspersion with FISH signal for the AAGAG satellite (S16 Fig); this could be due to nonspecific binding of the AAGAG probe during FISH, which is optimized for Oligopaint specificity, or to a possible collapse of AAGAG repeats in our assembly, including within Maupiti. We also noticed that the estimated length of the Oligopaint-stained region was larger than the size of Maupiti in our CenX contig (100.5 ± 49 kb versus 44 kb; Figs 2A and 5G), a discrepancy that we attribute to variability in Maupiti Oligopaint probe hybridization. Alternatively, there could be additional sequences with similarity to Maupiti interspersed in the flanking satellites nearby the contig (and not included in our assembly).
Analysis of Cen4 shows that the CENP-A domain overlaps primarily with 4 Lampedusa and partially with the flanking AAGAT satellite ( Fibers for 3 Giglio show colocalization between CENP-A and the island as well as a short, variable region of colocalization with flanking dodeca satellite ( Fig 5D, S19 Fig and S20 Fig). We did not observe CENP-A signal on the opposite side of Giglio, where Prodsat is located according to our assembly ( Fig 5D). The Cen3 satellite dodeca colocalizes with CENP-A on fibers in S2 cells [12] and is highly enriched in our CENP-A ChIP-seq ( Fig 1B and S1 Fig). When we tracked longer fibers from 3 Giglio along dodeca, we observed a second CENP-A domain in which dodeca is interrupted by short fragments of Oligopaint FISH signal (S20 Fig), suggesting the existence of DNA sequences with homology to Giglio interspersed within dodeca that are not included in our assembly. A previous study identified sequences with homology to IGS within the dodeca satellite in one bacterial artificial chromosome (BAC) [12]. It is possible that the dodeca-associated Oligopaint FISH signal in our extended fibers corresponds to these additional IGS sequences. These data indicate that Cen3 has two CENP-A domains, a major one on 3 Giglio and one minor one on dodeca, although these appear as a single domain in standard metaphase spread IF. Unlike Cen3, all other centromeres display a single CENP-A domain by fiber analysis (e.g., see S21 Fig for Cen2). Our conclusions differ from those of the Talbert et al. study [16], which concluded that dodeca was not associated with CENP-A. As recognized by the authors, it is possible that different chromatin preparations, such as the MNase digestion, may introduce biases, leading to an underrepresentation of sequences like dodeca in ChIPs [16].
Lastly, we analyzed the organization of 2 Capri using FISH with a satellite combination unique to this chromosome AATAG, AAGAG, and Prodsat and found that the CENP-A domain overlapped with all three satellites ( Fig 5E and S21 Fig). Thus, we speculate that the Prodsat sequences pulled down by CENP-A as seen in our kmer analysis ( Fig 1B) and reported previously [16] are coming from Cen2, not Cen3. We therefore conclude that D. melanogaster CENP-A is primarily associated with the centromeric islands of chromosomes X, 4, Y, and 3 and less predominantly with the flanking satellites ( Fig 5G).

G2/Jockey-3 is centromere-associated in D. simulans
The G2/Jockey-3 retroelement is a recently active transposon [36] shared among all D. melanogaster centromeres (Fig 3A). To determine if G2/Jockey-3 is enriched at the centromeres outside of D. melanogaster, we investigated its centromeric distribution in its sister species, D. simulans, which diverged from D. melanogaster only about 2 million years ago [44] and yet displays major differences in satellite composition and distribution [25,45]. These differences are especially apparent in centromeric regions, where D. melanogaster displays simple satellite repeats whereas D. simulans contains complex satellite repeats with larger repeat units [16]. We reanalyzed published D. simulans cell line CENP-A ChIP-seq data [16] (see S1 Text) and found that G2/Jockey-3 elements are also highly enriched in CENP-A in this species, as in D. melanogaster. The pileups of CENP-A ChIP-seq reads on G2/Jockey-3 show that CENP-A is associated with the entire length of the retroelement in both D. simulans and D. melanogaster, with no apparent affinity for any particular sequence (Fig 6A and 6B).
To validate the association of G2/Jockey-3 with D. simulans centromeres, we designed a FISH probe that targets about 1.6 kb at the 3 0 of the D. melanogaster G2/Jockey-3 consensus sequence (see Materials and methods; approximately 94% identical to D. simulans G2/Jockey-3 consensus sequence) and performed IF-FISH on male larval brain metaphase spreads with anti-CENP-C antibodies, which recognize CENP-C in both species [46]. We observed colocalization between CENP-C and G2/Jockey-3 at all D. simulans centromeres ( Fig 6C; note that chromosome 2 and 3 of D. simulans cannot be distinguished morphologically [25]). The same probe showed colocalization of CENP-C and G2/Jockey-3 at all D. melanogaster centromeres, except at Cen2, which is consistent with our model for this centromere showing only one copy of G2/Jockey-3 (Figs 6D and 2E). Based on these observations, we infer that G2/Jockey-3 is a conserved centromere-associated retroelement in these species.

Discussion
Our study shows that combining long-read sequencing with ChIP-seq and chromatin fiber FISH is a powerful approach to discover centromeric DNA sequences and their organization. We reveal that for all but one chromosome (chromosome 2, which has a single G2/Jockey-3 element), approximately 70% of the functional centromeric DNA of D. melanogaster is composed of complex DNA islands. The islands are rich in non-LTR retroelements and are buried within large blocks of tandem repeats ( Fig 7A). They likely went undetected in previous studies of centromere organization (e.g., [12]) because three of the five islands are either missing or incomplete in the published reference D. melanogaster genome [10]. A recent study reported that satellite DNA repeats make up the majority of centromeric DNA in D. melanogaster embryos and S2 cells, by counting the relative number of motifs matching simple repeats in CENP-A ChIP relative to input [16]. Our reanalysis of those data showed CENP-A enrichment on the islands, suggesting that having an improved reference genome assembly [19] is crucial for identifying centromeric DNA sequences. To our knowledge, this is the first detailed report on the linear sequence of all centromeres in a multicellular organism. Our overall strategy therefore provides a blueprint for determining the composition and organization of centromeric DNA in other species.
To date, satellite DNAs have been regarded as the main sequence components of the centromeres of primary animal model systems-humans, mice, and Drosophila [2,3,17]. However, retroelements are abundant and widespread at the centromeres of plants such as maize [48] and rice [49,50]. Retroelements are also found at the centromeres of fungi [53], humans [56], marsupials [63], bats [54], and gibbons [55], suggesting that they may be common centromeric features (Fig 7B). Our study shows that retroelements, particularly G2/Jockey-3, are not merely present near centromeres but are components of the active centromere cores through their association with CENP-A. Our BLAST search for G2/Jockey-3 retroelements suggests that they are restricted to the melanogaster subgroup; therefore, we hypothesize that different non-LTR retroelements may be present at the centromeres of other Drosophila species. Why retroelements are such ubiquitous components of centromeres and whether they play an active role in centromere function remain open questions. In maize, centromeric retroelements invade neocentromeres following their inception [64], suggesting a preference for DNA sequences associated with CENP-A chromatin for retroelement insertions [18]. On the other hand, evolutionarily new centromeres in Equus asinus lie in LINE-rich regions [65], and a LINE element was found to be an integral component of a human neocentromere [57,66], raising the possibility that it is CENP-A that may bind preferentially to retroelement-associated genomic regions [18]. Other models have proposed that retroelements could produce noncoding RNAs that affect centromere specification [18,66] and that retroelement activity could help maintain centromere size through retrotransposition or by giving rise to tandem repeats via recombination-mediated mechanisms (e.g., [67,68]; reviewed in [69]).
Centromeric transcription contributes to centromere homeostasis in several organisms, including fission yeast [70,71], wallaby [72], human [6,73], and Drosophila cells [74,75]. Our preliminary analysis with quantitative RT-PCR using centromere-specific G2/Jockey-3 primer sets shows some evidence for low levels of centromere expression.  [16] showing that G2/Jockey-3 is enriched in CENP-A in D. simulans. The labels "15m" and "5m" indicate minutes of MNase digestion, and IP and IP2 are technical replicates. Note that the first 487 bp of D. simulans G2/Jockey-3 consensus sequence, which are homologous to the D. simulans 500-bp satellite, are not included in this figure; the 500-bp satellite was previously reported as enriched in CENP-A in D. simulans [16]. (B) Plot of the normalized CENP-A enrichment over input across the D. melanogaster G2/ Jockey-3 consensus sequence using our CENP-A ChIP-seq replicates (R1-R4) and ChIP-seq from CENP-A-GFP transgenic flies from Talbert and colleagues [16]. The underlying data for (A-B) can be found in S1 Data. IF-FISH on (C) D. simulans (w501) and (D) D. melanogaster (iso-1) mitotic chromosomes from male larval brains using an antibody for CENP-C (magenta) and FISH with a G2/Jockey-3 DIG-labeled FISH probe (yellow). DAPI is shown in gray. Bar 5 μm. CENP-A, centromere protein A; CENP-C, centromere protein C; ChIP, chromatin immunoprecipitation; ChIP-seq, ChIP sequencing; DIG, digoxigenin; FISH, fluorescence in situ hybridization; GFP, green fluorescent protein; IF, immunofluorescence; IP, immunoprecipitation. https://doi.org/10.1371/journal.pbio.3000241.g006 In addition to retroelements, the centromeres of D. melanogaster display a diverse assortment of repeats, none of which are exclusive to centromeres, with the exception of IGS, for which we identified a centromere-enriched variant. The identification of the IGS tandem repeat within 3 Giglio is intriguing, as IGS sequences are dynamic in the potato [76], where they are located near the centromere, as well as in the tobacco [77], the tomato [78], and the common bean [79], where they show a dispersed pattern over several chromosomes. The origin of novel tandem repeats is still elusive, but one way it has been proposed to occur for the IGS For at least CenX, Cen4, and Cen3, the bulk of CENP-A chromatin is associated with the centromere islands, whereas the remaining CENP-A is on the flanking satellites. The sequences flanking the Y centromere are not in our assembly, so whether CENP-A is also on satellites is unknown. Although the complexity of island DNA allowed us to identify centromere contigs by long-read sequencing, the flanking satellites remain largely missing from our genome assembly because of their highly repetitive nature. The approximate satellite size estimates are based on Jagannathan and colleagues' work [25]. (B) Phylogenetic tree showing that centromereassociated retroelements are common across highly diverged lineages: Gossypium hirsutum (cotton) [47], Zea mays mays (maize) [9,48], Oryza sativa (rice) [49][50][51], Triticum boeoticum (wild wheat) [52], Cryptococcus [53], Phyllostomid (bat) [54], Hoolock leuconedys (gibbon) [55], Homo sapiens (human) [56] (and a human neocentromere [57]), Macropus eugenii (tammar wallaby) [58][59][60], Phascolarctos cinereus (koala) [61], and D. melanogaster (this study for endogenous centromeres; also in an X-derived minichromosome [14,15]). The phylogeny was constructed using TimeTree [62]. Indicated are the retroelement type and the clade that the element belongs to with element types as follows: LTR and non-LTR. The circles indicate the experimental evidence for centromere association of retroelements: FISH, CENP-A ChIP-seq (ChIP), and genome or BAC sequencing (Seq). BAC, bacterial artificial chromosome; CENP-A, centromere protein A; CenX, X centromere; ChIP-seq, chromatin immunoprecipitation sequencing; CRM, centromeric retrotransposons of maize; CRR, centromeric retrotransposons of rice; CRW, centromeric retrotransposons of wheat; FISH, fluorescence in situ hybridization; LAVA, LINE-Alu-VNTR-Alu-like; LINE, long interspersed nuclear element; LTR, long terminal repeat; Mya, million years ago. repeat in plants is through the initial insertion of a retroelement within rDNA, followed by IGS duplication, amplification, and transposition to a new locus [78].
Defining the span of the CENP-A domain is important to understand precisely which sequences are associated with centromere activity and which are part of pericentric heterochromatin. Although we are able to confidently map our ChIP-seq reads to the islands to determine CENP-A occupancy, the same cannot be done for simple satellites, because of the limitations of mapping to highly repetitive DNA. We therefore infer the organization of the centromere from analyzing extended chromatin fibers by IF-FISH. Blocks of simple satellite sequences flank the islands on each of our contigs, with the exception of the CenY contig. However, these regions represent only a fraction of the estimated abundance of those repeats in the genome. For example, the dodeca satellite occupies approximately 1 Mb of the genome [27], yet only about 570 kb of dodeca sequence are included in the assembly, with just roughly 35 kb of dodeca on the Cen3 contig. Therefore, for many satellite sequences, inferences based on read mapping, even uniquely mapped reads, are confounded by the underrepresentation of satellites in the assembly. Our analysis of chromatin fibers suggests that CENP-A spans beyond the islands into the simple satellites, although the precise boundaries remain unknown ( Fig 7A).
The finding that CENP-A can bind to several different sequences that are not uniquely associated with centromere regions is consistent with the epigenetic model of centromere specification, which proposes that specific sequences alone do not govern centromere activity [3]. Yet it is possible that the diverse sequence arrangements observed at each centromere somehow contribute to centromere activity or specification [18,48]. Possible mechanisms include the promotion of unusual types of transcription, as reported for fission yeast [80], or the formation of non-B DNA structures (e.g., stem loops, hairpins, and triplexes) that may promote CENP-A deposition [7,12,81]. Knowing the identity of D. melanogaster centromeric DNA will enable the functional interrogation of these elements in this powerhouse model organism.

ChIP-seq
CENP-A ChIPs were performed using an affinity purified rabbit anti-CENP-A antibody (gift of Gary Karpen) that we previously verified works well for ChIP using S2 cells that contain LacI/lacO inducible ectopic centromeres and showing that CENP-A ChIP pulled down lacO plasmid DNA sequences [82].
ChIP in embryos. Embryo (wild type line Oregon-R) collection, fixation, and chromatin isolation were performed as described in [83]. We carried out four ChIP replicates as follows. From one embryo collection, we generated chromatin used in R1; from a second independent embryo collection, we generated chromatin used for replicates R2-4. We used formaldehydecrosslinked overnight collections of Oregon-R embryos (about 1.5 g per collection). Chromatin was sheared to 200-500 bp using a Bioruptor sonicator (Diagenode), aliquoted, and flash frozen. The first biological replicate (R1) was performed following the protocol in [83] using 165 μg of chromatin (in 500 μl volume and 30 μl of protein A agarose beads) and 2 μl of anti CENP-A antibody. For R2, 3, and 4, we used the MAGnify kit, with 15 μl of dynabeads, approximately 60 μg of chromatin in 200 μl volume, and 3 μl of anti-CENP-A antibody. Libraries were made from eluted DNA using the TruSeq ChIP kit (Illumina) for R1 and R4, whereas the Accel-NGS 2S Plus DNA Library (Swift Biosciences) was used for R2 and R3. Note that R2-3 were performed in parallel and sequenced the same way and are thus technical replicates. The libraries were sequenced by paired-end on the NextSeq platform using Reagent v.2. Chromatin extracted from the second embryo collection was also used for ChIP-qPCR experiments.
For both chromatin preparations, the quality of the chromatin was confirmed by control ChIPs with 15 μg of chromatin in 200 μl volume and 2 μl of rabbit anti-H3K27Ac (Thermo-Fisher). The eluted DNA was analyzed by qPCR confirming enrichment of the RpL32 promoter (F-TTGTTGTGTCCTTCCAGCTTCA and R-TTGTTGTGTCCTTCCAGCTTCA) and lack of enrichment of RpL32 5 0 region (F-GGCACGGCGCCAAAATTAATCA and R-CCGATGCCACTGCCTCTTTGGT) [84,85].
ChIP in S2 cells. Chromatin from 10 6 fixed Drosophila S2 cells (approximately 90 μg) were used for each IP, and chromatin was sheared to 100-300 bp using a Covaris sonicator. ChIPs were performed using the MAGnify kit (ThermoFisher). The anti-CENP-A antibody (1 μl) was coupled to 10 μl of beads for 2 h followed by incubation with chromatin overnight at 4˚C. DNA was eluted in 50 μl of elution buffer. Libraries were generated using the TruSeq kit (Illumina) and paired-end sequenced using the Reagent kit v.3. (Illumina) on the NextSeq platform.
ChIP-seq quality control analyses. We estimated read quality of each replicate ChIP-seq experiment using two metrics estimated in phantompeakqualtools [86]: the normalized strand coefficient (NSC) and the relative strand correlation (RSC) (S6 Table). These statistics report the cross correlation between Watson and Crick strands, as ChIP reads from a true positive are expected to be highly clustered and accumulate on either side of the binding site on both strands, with a shift between the peaks on the Watson and Crick strands that is determined by read length and fragment length distribution [87]. This shift should not occur in the input. NSC is the fragment-length cross-correlation peak divided by the background cross correlation and RSC is the fragment-length cross-correlation peak divided by the read-length peak [86].

Analysis of repeat enrichment in ChIP-seq replicates
To determine the CENP-A enrichment in simple tandem repeats, we summarized repeat composition in the trimmed reads and identified overrepresented kmers using kseek (https:// github.com/weikevinhc/k-seek; [31]). The CENP-A/input ratio is normalized by the number of mapped reads to the genome assembly to remove possible read contamination. We consider a class of repeats to be enriched for CENP-A if the minimum number of kmers in the input is �10 in each replicate and the median normalized CENP-A/input ratio is >1 across all four replicate ChIP experiments (S1 Fig). Simple tandem repeats may be overrepresented or underrepresented because of Illumina library preparation and the effects of PCR amplification on sequence library complexity. To determine CENP-A enrichment on complex repeats, we used a mapping approach. We annotated repeats in our assembly [19] using a custom Drosophilaspecific consensus repeat library [43] modified from Repbase to include complex satellite DNAs (Repbase version 20150807; [88]; Dryad repository file 1: https://doi.org/10.5061/dryad. rb1bt3j [37]). Using these RepeatMasker annotations, we generated a comprehensive library of all individual repetitive elements in the genome to capture sequence variation among repeats. We mapped ChIP and input reads to this comprehensive repeat library using bowtie2 (default settings) and summarized read counts for each type of complex repeat (e.g., TEs, complex satellite DNAs with repeat units > 100 bp) using custom python scripts. The CENP-A/input ratio is normalized by the number of mapped reads to the genome assembly. We consider a class of repeats to be enriched for CENP-A if it is in top 20th percentile of normalized CENP-A/input in all four replicate ChIP experiments.
To address if any motif in G2/Jockey-3 is particularly enriched for CENP-A, we constructed a consensus sequence of G2/Jockey-3 in D. melanogaster and D. simulans. We mapped ChIP and input reads to this comprehensive repeat library with only one version of G2/Jockey-3 (either D. melanogaster or D. simulans) using bwa (default settings). We then called the depth of reads with samtools depth (v1.7) using "-Q 10 (mapping quality � 10)" and calculated ChIP/input ratio across each site after normalization by the number of mapped reads to the genome assembly.

ChIP-qPCR
qPCR was performed using SYBR-green (Bio-Rad) on a CFX96 Real-Time System (Bio-Rad). Input or ChIP eluted DNA (1 μl) was used in each qPCR reaction. Melting curves were analyzed to ensure primer specificity. Only primers with reaction efficiencies within a linear dynamic range were used. The fold enrichment of centromeric DNA after immunoprecipitation of CENP-A chromatin compared to its level in the bulk input chromatin was calculated with the equation 100 × E (Ctinput − Ctip) , where E is the efficiency of the primer set. Enrichment values were normalized by the enrichment value of RpL32 as a noncentromeric control. qPCR primer sets are listed in S7 Table.

Transcription of centromeric G2/Jockey-3 elements
Total RNA was extracted from three independent overnight collection of embryos (iso-1). Briefly, embryos were scooped from apple juice plates and rinsed with water in a mesh basket, dechorionated in 50% bleach for 3.5 min with gentle shaking, rinsed thoroughly with water, moved to a 1.5-ml microfuge tube, and resuspended in 300 μl of Trizol reagent (Sigma-Aldrich). Embryos were homogenized using a motorized pestle until the solution became clear (30-40 s). The homogenized solution was centrifuged at 13,000 rpm for 10 min at 4˚C, and the clear supernatant was transferred to a new RNAse-free tube. RNA was isolated using the Direct-Zol RNA miniprep plus kit (Zymo Research) according to the manufacturers' protocol. The RNA was eluted in 30 μl of RNAse-free water and quantified with a Nanodrop. A total of three consecutive Turbo DNase (Invitrogen) treatments, each followed by RNeasy Cleanup (Qiagen), were performed to remove DNA contamination.
Reverse transcription was performed using the iScript Select cDNA Synthesis kit (Bio-Rad) according to the manufacturer's instructions. Briefly, 75 ng of total embryo RNA was used to make cDNA libraries using random priming in a 30 μl reaction. For the no-RT control, the reverse transcriptase was omitted from the reaction. qPCR was performed as described for ChIP-qPCR using 1 μl of cDNA in each reaction and primers sets targeting G2/Jockey-3 copies from each centromere (X-G2, 4-G2, Y-G2, 3-G2, 2-G2). Primers for Actin5C were used as a positive control for a highly expressed gene, whereas primers for the testis-specific gene, Mst84Da, were used as a control for a nonexpressed gene. The no-RT samples produced Ct values comparable to the negative nonexpressed control showing successful removal DNA.
Gene expression was analyzed as done by Schmittgen et al. [105] by determining the mean 2 -ΔCt , where ΔCt is (Ct G2/Jockey-3 − Ct Mst84Da ), from three biological replicates. Primer sets are listed in S7 Table. IF and FISH S2 mitotic chromosome preparation. Preparation of mitotic chromosomes from Drosophila S2 cells was performed as described in [82]. Cells (2 × 10 5 ) were treated with 0.5 μg/mL demecolcine solution (Sigma-Aldrich) and incubated at 25˚C for 1 h to induce a mitotic arrest. Cells were pelleted (600g for 5 min) and resuspended in 250 μL 0.5% (w/v) sodium citrate for 8 min. Cells were loaded into cytofunnels and spun onto Superfrost Plus slides (VWR) at 1,200 rpm for 5 min using a Shandon Cytospin 4 (ThermoFisher). Cells were fixed for 10 min with 3.7% formaldehyde in PBS, 0.1% Triton X-100 (PBS-T). Slides were washed three times in PBS-T for 5 min and stored at 4˚C until ready for use. D. melanogaster and D. simulans mitotic chromosomes preparation. Preparation of mitotic spreads was carried out from iso-1 D. melanogaster flies (Bloomington Drosophila Stock Center stock no. 2057: y 1 ; Gr22b iso-1 Gr22d iso-1 cn 1 CG33964 iso-1 bw 1 sp 1 ; MstProx iso-1 GstD5 iso-1 Rh6 1 ) and D. simulans (w501, gift of Andy Clark) in larvae following the method in [106] with minor modifications. Third instar larval brains from male larvae were dissected in PBS and immersed in 0.5% (w/v) sodium citrate for 8 min. Individual brains were fixed for 6 min in 6 μL of 45% acetic acid, 2% formalin (Sigma-Aldrich) on siliconized coverslips. Whole brains were applied to clean poly-L-lysine slides (ThermoFisher) and were manually squashed between coverslip and slide by pressing with the thumb. Slides were immersed in liquid nitrogen. Once bubbling stopped, the slides were removed from liquid nitrogen and the coverslip was immediately removed using a razor blade. Slides were immediately immersed in PBS and were either washed for 5 min before proceeding to IF or stored at 4˚C in PBS until ready for use.
IF staining. For IF, slides were washed in PBS-T for 5 min. S2 cell slides were blocked in 5% milk in PBS-T for 30 min. Larval squashes were blocked in 1% BSA, PBS, 0.02% sodium azide for 30 min. Primary antibodies anti-CENP-A (larval brain slides: rabbit, 1:500, Active Motif; S2 cell slides: chicken, 1:1,000, [30]) and anti-CENP-C (larval brain slides: guinea pig, 1:500 [34]) were diluted in blocking solution and incubated on slides overnight at 4˚C. Slides were washed three times for 5 min in PBS-T and incubated with secondary antibodies (Life Technologies Alexa-488, 546, or 647 conjugated, 1:500) diluted in blocking solution and incubated at room temperature for 1 h or overnight at 4˚C. Slides were washed three times for 5 min in PBS-T.
Satellite FISH. Satellite FISH was performed following the protocol described in [107] with a few modifications. Slides were postfixed in 3.7% formaldehyde and PBS for 10 min, followed by a rinse in PBS and two 5-min washes in 2xSSC, 0.1% Tween-20 (2xSSC-T). Slides were washed once for 5 min in 50% formamide (Sigma-Aldrich), 2xSSC-T at room temperature, once for 20 min in 50% formamide (Sigma-Aldrich), 2xSSC-T at 60˚C, and then cooled to room temperature. For FISH, 25 μL of hybridization mix containing 40 pmol of each probe (S11 and S12 Tables), 2xSSC-T, 10% dextran sulfate (Merck), 50% formamide, and 1 μL of RNase Cocktail (ThermoFisher) was applied to a 22 × 22-mm hybrislip (Electron Microscopy Sciences), mounted on the slide and sealed with paper cement. Slides were denatured at 92˚C for 2.5 min and then incubated overnight at 37˚C. Slides were washed in 2xSSC-T at 60˚C for 20 min, followed by two 5-min washes in 2xSSC-T at room temperature and one 5-min wash in PBS. Slides were mounted in Slowfade Gold Reagent (Invitrogen) containing 1 μg/mL DAPI and sealed with nail polish.
Oligopaint FISH. Oligopaint FISH was performed as described above with the following modifications. Hybridization mix (25 μL) containing 10 pmol of Oligopaint, 2xSSC-T, 10% dextran sulfate (Merck), 60%-68% formamide (Sigma-Aldrich), and 1 μL RNase cocktail (ThermoFisher) was applied to a 22 × 22-mm hybrislip (Electron Microscopy Sciences), mounted on the slide, and sealed with paper cement. Slides were denatured at 92˚C for 2.5 min in a thermocycler (Eppendorf) and incubated overnight at either 37˚C or 42˚C (see S10 Table for the percent of formamide and hybridization temperatures used). For fluorescence detection, 10 pmol of Alexa-488-labeled secondary oligos (see S13 Table) were applied either during the overnight hybridization or following posthybridization washes, in which 25 μL of 2xSSC, 30% formamide, 10 pmol of probe was applied to each slide and incubated at room temperature for 30 min. Slides were washed twice in 2xSSC, 40% formamide for 20 min, once in 2xSSC-T for 15 min, and once in PBS for 5 min. Slides were mounted as described above, and successful hybridization was checked under fluorescence microscope. Satellite probes were added after imaging by removing the coverslip with a razor blade; washing slides three times in 2xSSC-T for 5 min; applying 25 μL of 2xSSC, 30% formamide, 40 pmol of satellite probe to each slide; and incubating at 37˚C for 1 h. Slides were washed once in 2xSSC-T at 60˚C for 20 min, twice in 2xSSC-T for 15 min, and once in PBS for 5 min and mounted as described above.
G2/Jockey-3 FISH. FISH for G2/Jockey-3 was performed as described by Dimitri et al. [108]. Slides were dehydrated in an ethanol row (successive 3-min washes in 70%, 90%, and 100% ethanol) and allowed to air-dry completely. Probe mix (20 μL) containing 2xSSC, 50% formamide (Sigma-Aldrich), 10% dextran sulfate (Merck), 1 μL RNase cocktail (Thermo-Fisher), and 100 ng of DIG-labeled G2 probe was boiled at 80˚C for 8 min, incubated on ice for 5 min, and then applied to slides, covered with a glass coverslip, and sealed with paper cement. Sealed slides were denatured on a slide thermocycler for 5 min at 95˚C and incubated at 37˚C overnight. Slides were then washed three times for 5 min in 2xSSC, 50% formamide at 42˚C, three times for 5 min in 0.1xSSC at 60˚C, and then blocked in block buffer 1% BSA, 4xSSC, 0.1% Tween-20 at 37˚C for 45 min. Slides were incubated with 50 μL of block buffer containing a fluorescein-labeled anti-DIG antibody (sheep, 1:100, Roche) for 60 min at 37˚C. Slides were then washed three times for 5 min in 4xSSC, 0.1% Tween-20 at 42˚C, and mounted as described above.
Preparation of extended chromatin fibers and IF-FISH. Extended chromatin fibers were prepared as described by Sullivan [42], with a few modifications. Three to four brains from third instar iso-1 wandering larvae (females were selected to avoid cross-centromere hybridization of our X Maupiti and 4 Lampedusa Oligopaints with CenY, whereas males were used for Y Lipari ) were dissected in 0.7% NaCl and dissociated in 250 μl 0.5% (w/v) sodium citrate containing 40 μg collagenase/dispase (Sigma-Aldrich) by incubating at 37˚C for 10 min. This mixture was briefly vortexed, spun, and loaded into a single-chamber Shandon cytofunnel for centrifugation in a Shandon Cytospin 4 at 1,200 rpm for 5 min onto a clean polysine slide (ThermoFisher). After centrifugation, the slides were immediately immersed in a glass coplin jar containing lysis buffer (500 mM NaCl, 250 mM urea, 25 mM Tris-HCl [pH 7.4], 1% Triton X-100) for 13-15 min, following which the slides were gently removed at a steady speed of about 25-30 s per slide. Fibers were fixed in 4% formaldehyde solution and washed in PBS for 5 min. After washing, the slides were processed for IF-FISH.
Fibers were extracted in PBS-T for 10 min then incubated in a 1.5% BSA, PBS blocking solution for 30 min. Slides were incubated with an anti-CENP-A antibody (rabbit, 1:100, Active Motif) diluted in blocking solution overnight at 4˚C in a humidified chamber. Slides were washed three times in PBS for 5 min and then incubated for 45 min with secondary antibodies (Cy5-conjugated donkey anti-rabbit, 1:500, Life Technologies) diluted in blocking buffer at room temperature, followed by three 5-min washes in PBS. Slides were postfixed in 3.7% formaldehyde, PBS for 10 min followed by one quick rinse and two 5-min washes in PBS. FISH was performed as described for 3D-FISH [107] with a few modifications. Slides were washed twice in 2xSSC-T at room temperature for 5 min, followed by denaturation in 50% formamide, 2xSSC-T at room temperature for 5 min, transferred at 60˚C for 20 min, and then cooled to room temperature. Primary Oligopaint probes (10 pmol, except X Maupiti , which was 25 pmol) and 40 pmol of satellite DNA probes were each added to the slides in 25 μL of hybridization solution-2xSSCT, 60% or 68% (v/v) formamide (see S10 Table), 1 μL RNase cocktail (ThermoFisher), 10% dextran sulfate (Merck)-and sealed with a 22 × 22-mm hybrislip (Electron Microscopy Sciences) using rubber cement. Slides were then denatured at 92˚C for 3 min on a slide thermocycler and allowed to hybridize overnight at 37˚C or 42˚C (see S10 Table) in a humidified chamber. Slides were washed once in 2xSSC-T for 15 min at 60˚C, once in 2xSSC-T for 10 min at room temperature, and once in 0.2xSSC for 10 min at room temperature. Following the washes, 25 μL of hybridization mix containing 2xSSC, 30% formamide, 40 pmol fluor-labeled secondary oligo probes (see S13 Table) was added on to each slide and incubated for 45 min at room temperature, except for the X Maupiti slides, in which the satellite probe was also added with the secondary Oligopaint probe. The slides were then washed once in 2xSSC-T at 60˚C for 15 min, followed by one wash in 2xSSC-T and 0.2xSSC for 10 min at room temperature and mounted as described above.
For FISH with only satellite probes, posthybridization washes consisted of one wash in 2xSSC-T for 20 min at 60˚C, followed by one wash with 2xSSC-T at room temperature for 10 min and two 5-min washes in 0.2xSSC at room temperature. Slides were mounted as described above.
For fiber measurement calibration, FISH using the 61C7 and 80C4 probes was performed using the conditions for Oligopaint FISH (see S10 Table for percent of formamide and hybridization temperatures used), whereas FISH using the Rsp probe was performed using the satellite FISH protocol.
Microscopy and image analysis. Image acquisition was done at 25˚C using an Inverted Deltavision RT restoration Imaging System (GE) equipped with a Cool Snap HQ 2 camera (Photometrics) and 100×/1.40 NA oil immersion lens (Olympus). Image acquisition and processing was performed using softWoRx software (GE). For mitotic chromosomes, 20 z-stacks were taken per image at 0.2 μm per slice. For fibers, 12-15 z-stacks were taken per image at 0.15 μm per slice. Images were deconvolved using the conservative method for 5 cycles. Maximum intensity projections were made using 3-5 z-stacks. Images were saved as Photoshop files and were scaled using Adobe Photoshop. Figure assembly was done using Adobe Illustrator.
Maximum intensity projections of individual fibers were analyzed to measure the signal length of various signals on fibers using the "measure distances" tool in Softworks (GE). Three calibration probes of known length (100 kb; see S10 Table for 80C4 and 61C7 Oligopaints; see Dryad repository file 11: https://doi.org/10.5061/dryad.rb1bt3j [37] for Rsp probe) were used to determine the degree of stretching in our experiments. At least 20 fibers for each probe were measured in all cases. Length measurements were visualized by scatter plot using Prism. These lengths were then used to determine the average stretching in kb/μm, and Student t test was used for statistical analyses.
We noticed that the variation in the measurements for the island Oligopaints was greater than what we observed for the probes used for calibration. We attribute this higher variation to the lower density of island Oligopaint probes (some of the island sequences that are shared among centromeres were not targeted by probes to increase specificity), which causes the signal to be weaker and less consistent than in standard Oligopaint FISH. It is also important to note that we analyzed fibers from a mixed population at different stages of the cell cycle, which could display differences in CENP-A signal. It is also possible that the stretching of the chromatin at the centromere is more variable than at noncentromeric regions.
Oligopaints design. Oligopaint libraries were designed using the OligoMiner pipeline [38,109] with some variations. The genomic regions that showed significant enrichments of CENP-A via MACS and enriched ChIPtigs were targeted for Oligopaint design. The blockparse.py script (v1.3) using overlap mode was used to identify as many candidate probes as possible, with genome-targeting regions 35-41 bp long and a desired melting temperature of 42-47˚C. Unlike standard Oligopaints design, the candidate probes were not aligned to the genome using Bowtie2 [110] or filtered with OutputClean.py, so that probes that align multiple times would not be discarded. Candidate probes with partial alignments of 18 bp-long kmers were filtered out using kmerfilter.py (v1.3) and Jellyfish [111], excluding any that matched 6 or more times to the genome. Probes were filtered further for least secondary structures using StructureCheck.py (v1.3) and NUPACK [112]. Finally, coverage and density of probes across the regions of interest and presence of densely clustered off-target alignments were manually checked by Bowtie2 alignment, filtering for different levels of mismatch to assess the effects of hybridization stringency.
For the design of control regions for length standards in chromatin fiber stretching measurements, in loci 80C4 (3L: 23, 047,118.. 23,147,118) and 61C7 (3L: 626,646..726,646), we used conventional Oligopaint design for nonrepetitive genomic regions. The blockparse.py script (v1.3) was used to identify candidate probes, with genome-targeting regions 35-41 bp long and a desired melting temperature of 42-47˚C. Candidate probes were then aligned to the dm6 reference genome (with NNN masking of repetitive regions) using Bowtie2 and its output filtered using outputClean.py (v1.5.4) to keep only those probes that are predicted to thermodynamically only hybridize on target under the specific conditions used. Finally, candidate probes were then further analyzed through kmerFilter.py (v1.3) to reject any probes containing regions of microhomology to off-target sites and through StructureCheck.py (v1.3) to exclude any probes forming restrictive secondary structures.
Each oligo included universal primers at the 5 0 and 3 0 ends for PCR amplification and a library-specific barcode for both PCR amplification and FISH detection of each individual centromere set. One library per centromere was synthesized as a single chip by Custom Array.
Library amplification. Raw Oligopaint libraries were amplified in 100-μl reactions containing 10 μl KAPA Buffer A and 1 μl KAPA Taq from the KAPA Taq PCR Kit (Fisher Scientific), 1 μl of library, 0.4 mM dNTPs (Roche), and 2 μM of each universal primer (S14 Table) and amplified using the following cycles: 95˚C for 5 min; 25 cycles of 95˚C for 30 s, 58˚C for 30 s, and 72˚C for 15 s; and a final extension at 72˚C for 5 min. Reactions were purified using DNA Clean & Concentrator-5 (Zymo Research) using the manufacturer's protocol.

Hi-C analysis
We used a publicly available Hi-C dataset from embryos (Gene Expression Omnibus accession number GSE103625) to provide additional support for our candidate centromeric contigs [41]. We mapped Hi-C sequence reads to our assembly and processed the output with the HiC-Pro pipeline [113] to obtain informative valid interaction pairs (default settings). We used a customized python script to count interactions between regions of interest and then normalized to the size of the regions (per 100 kb). To count interactions between differentsized windows, we used BEDTools [114] to create windows of specified sizes across the assembly. We established the euchromatin-heterochromatin boundaries in our assembly based on previous studies. For chromosome 2, 3, X, and Y, we transferred the euchromatin-heterochromatin boundary coordinates previously reported for D. melanogaster [115] to our assembly. For chromosome 4, we assigned the approximately 70 kb closest to the centromere in the assembled chromosome 4 as heterochromatin based on what was previously reported [116] and the rest of it as euchromatin. We then binned the genome into different regions based on their sequence content: centromere, proximal heterochromatin, distal heterochromatin, and euchromatin (S16 Table). We then classified interactions between centromeric contigs and the different categories based on their genomic region (e.g., centromere to proximal heterochromatin, centromere to distal heterochromatin, etc.). We reported the median count for each category and conducted data visualization and statistics in R.
We calculated the significance between different categories using a Kruskal-Wallis test by ranks with Dunn's test for post hoc analysis and the pairwise Wilcoxon rank sum test with false discovery rate (FDR) correction [117] of type I error rates for multiple comparisons. We deemed a result to be significant only if both tests agree.

G2/Jockey-3 activity
We investigated whether G2/Jockey-3 non-LTR retroelements have evidence for recent activity based on insertion polymorphism and expression. We examined RNA-seq reads from testes for evidence of G2/Jockey-3 because of the enrichment of these elements on the Y chromosome. We mapped poly-A [121] and total RNA [122] (S6 Table) transcriptome data to our repeat library using HISAT 2.1.0 [123] and estimated read depth of uniquely mapped read using samtools (depth-Q10; v1.7 [124]).

S1 Fig. Enrichment of simple tandem repeats in CENP-A ChIP-seq across four replicates.
Plot of normalized CENP-A/input for simple tandem repeats for each ChIP-seq replicate, sorted by median (red lines). Shown are only the simple tandem repeats with median CENP-A/input > 1 in all four CENP-A ChIP replicates (see details in S1 Table). The simple tandem repeats with fewer than 10 counts of input reads in any one replicate are not shown.  S4  Table). For the four replicate ChIP experiment in our OreR embryos, we examined the reproducibility of our experiments by first applying the IDR test and only keeping peaks with IDR � 0.05. The number of these peaks is plotted below the diagonal. Between replicates 2 and 3, we found a total of 16,870 overlapping peaks, but 16,833 were weakly enriched relative to the overlapping peaks between other datasets because they are technical repeats with a shared library bias (Accel, see Materials and methods). We therefore only report the 37 strongest peaks (the average peak number of other comparisons between replicates). The IDR dataset comparisons are in S5 Table. We show the correlation between the CENP-A ChIP replicates above the diagonal. Plotted are the signal strength after IDR tests (normalized ChIP over input ratio from 1 to 1,000 on a log10 scale) with Spearman's rho. The five contigs with the most consistent peaks within and among replicates correspond to the five centromeric candidates. (B) Plot of ChIP-seq data from S2 cells (this paper, [16,82]) and an independent embryo CID-GFP (i.e., CENP-A-GFP) ChIP-seq dataset (see details in S4 Table; [16]; "5m" and "15m" represent different MNase treatments). The centromeric contigs are also CENP-A enriched in these independent datasets, with the exception of the X chromosome centromere contig. S2 cells lack a Y and are therefore not expected to have peaks on the Y candidate centromere contig. CENP-A, centromere protein A; ChIP, chromatin immunoprecipitation; ChIP-seq, ChIP sequencing; CID, centromere identifier; GFP, green fluorescent protein; IDR, irreproducible discovery rate; OreR, Oregon-R; S2, Schneider 2. Locations of contig-specific qPCR primer binding sites are shown by magenta arrows. (B) Graph showing our ChIP-qPCR results using these primers. The enrichment is calculated relative to the input and is normalized by the RpL32 promoter region as a noncentromeric control. (C) Graph showing our ChIP-qPCR results using primers targeting other regions that showed CENP-A enrichment but that were not in our contigs. Again, the enrichment is calculated relative to the input and is normalized by RpL32 promoter as a noncentromeric control. We did not observe a robust CENP-A enrichment at these sites. The underlying data can be found in S2 Data. CENP-A, centromere protein A; ChIP, chromatin immunoprecipitation; qPCR, quantitative PCR. Shown is the plot of the normalized reads depth from uniquely mapped reads (mapping quality � 10) across the G2/Jockey-3 consensus element obtained from mapping total and poly-A RNA-seq data from testes [121,122] to our repeat library. (B) Quantitative RT-PCR analysis of total RNA extracted from three independent overnight embryo collections. Expression levels were compared to the negative control gene Mst84Da (testis-specific). The G2/Jockey-3 copies surveyed on centromere ("Cen") X, 4, and 3 but not Y and 2 show low levels of transcription compared to the housekeeping gene Actin. Although the primers (S7 Table) are specific for each centromere, the primer sets could amplify G2/Jockey-3 copies not included in our assembly. Error bars = SD. The underlying data for this figure can be found in S2 Data. Mst84Da, Male-specific RNA 84Da; RNA-seq, RNA sequencing; RT-PCR, reverse-transcription PCR.  [37]). Node support is only shown for key nodes in the tree (complete tree is in the Dryad repository file 14: https://doi.org/10.5061/dryad.rb1bt3j [37]). All centromeric IGS sequences appear to have a single origin: they duplicated from sex-linked IGS interspersed at the rDNA loci at some time near the divergence of the simulans clade and D. melanogaster. IGS repeats in blue (extra) are similar to the IGS at 3 Giglio but are on small contigs, tig00022795 and id = 102159_0. Contig tig00022795 is also moderately enriched in CENP-A. CENP-A,  Table. CENP-A, centromere protein A; FISH, fluorescence in situ hybridization; IF, immunofluorescence; S2, Schneider 2. (TIF)

S14 Fig. Quantification of interactions between centromeres and different genomic regions by Hi-C. Plots showing intra-and interchromosomal interactions between regions in
Hi-C data from: (A) stage 16 embryos (end of embryogenesis) and (B) embryonic cycles 1-8 (before zygotic genome activation; data from [41]). The different colors indicate interactions with individual centromeres of all chromosomes. Centromere-centromere interactions are significantly more frequent than interactions between centromeres and distal heterochromatin, interdistal heterochromatin, and euchromatin and marginally more significant than centromere-interproximal heterochromatin interactions. ���� adjusted P < 0.0001; � adjusted P < 0.02, pairwise Wilcoxon rank sum test with FDR correction; Kruskal-Wallis test by ranks with Dunn's test for post hoc analysis. The underlying data for this figure can be found in S2 Data. FDR, false discovery rate. . DAPI is shown in gray. We did not include satellite FISH because no centromeric satellites are known for the Y. Note that the Oligopaints only target part of Lipari (see Fig 5). CENP-A is observed occupying sequences beyond the Oligopaint region, likely over the remaining part of the island. Arrows show the region of the fiber that was measured.  Table. Enrichment of simple tandem repeats in kseek analyses. We used kseek [125] to estimate read counts for each kmer and normalized these read counts using the total mapped reads for each dataset (ChIP and input). We identified CENP-A-enriched kmers using the ratio of normalized counts for each ChIP experiment and its corresponding input. The enriched kmers reflect simple tandem repeats enriched in CENP-A discussed in the main text and S1 Fig. Fig 1 summarizes kmers with satellite repeats associated with centromeres. CENP-A, centromere protein A; ChIP, chromatin immunoprecipitation. (XLSX) S2 Table. Raw and normalized counts of reads mapped to the complex repeats. Rows correspond to complex repeat families (TEs and complex satellites), with the counts per family in the ChIP and input reads from every dataset. We calculated enrichment for each repeat type by normalizing by total mapped reads for each dataset and taking the ratio of normalized values for each ChIP and its corresponding input. ChIP, chromatin immunoprecipitation; TE, transposable element. (XLSX) S3 Table. ChIPtigs with peaks from MACS. We mapped all ChIP-seq data to the de novo assembled ChIPtigs and called peaks using MACS with high-quality reads (mapping quality � 30 and masked PCR duplicates). We also mapped ChIPtigs to the genome to determine its genomic location and assigned repeat IDs based on BLAST results. (XLSX) S4 Table. Peaks called by mapping to the genome assembly and MACS. We mapped the ChIP and input reads to our genome assembly and used the high-quality reads (mapping quality � 30 and masked PCR duplicates) to call ChIP peaks with MACS. We show the peak locations for each dataset. ChIP, chromatin immunoprecipitation. (XLSX) S5 Table. IDR tests between different replicates from OreR ChIP-seq. We used IDR to compare MACS peaks from different ChIP-seq replicates. We show the statistics for shared peaks from each comparison. ChIP-seq, chromatin immunoprecipitation sequencing; IDR, irreproducible discovery rate; OreR, Oregon-R. (XLSX) S6 Table. Summary of all sequencing datasets used in this study. We list reads and mapping summaries of all Illumina and long-read datasets generated in this paper or downloaded from NCBI's SRA. (XLSX) S7 Table. List of qPCR primers. List of primers used for qPCR in this study. The centromere contig that each target is associated with (X, 4, Y, 3, and 2) is designated in the "Centromere" column. Note that in silico PCR for the 3_G2 primers predicted three specific products from centromere 3 as well as two products on contig tig00022795 and additional nonspecific products from the X chromosome when three or more mismatches are allowed all of the same 145-bp size. qPCR, quantitative PCR. (XLSX) S8 Table. Noncentromeric overlapping peaks from MACS in the OreR embryo ChIP replicates. We listed peaks outside canonical centromeres with any agreement between replicate ChIP experiments (IDR � 0.05). We also report any genes or repeat annotations that overlap the MACS peaks. Note that there is no general enrichment in G2/Jockey-3 outside of the centromeric islands. ChIP, chromatin immunoprecipitation; IDR, irreproducible discovery rate; OreR, Oregon-R. (XLSX) S9 Table. Statistical analysis of TE distributions. We show the copy numbers of TEs in different genomic regions. The sums of base pairs in the assembly size in centromeres (432,440 bp), pericentromeric heterochromatin (37,089,066 bp), and other regions (118,457,213 bp) were used to compute the distribution statistics of TEs. We created a 2-by-2 contingency table for each TE comparing observed to expected (based on the sum of bp) for each comparison: centromere to heterochromatin ("cen-het") regions or centromeres to whole genome ("cengenome"). We computed a Fisher's exact test with FDR correction to get adjusted P values. G2/Jockey-3, G, Doc2, and Jockey-1 are significantly enriched in centromeres relative to other heterochromatic regions and to the whole genome. Asterisk signs show that TART and ProtoP are significantly underrepresented in centromeres relative to other heterochromatic regions. FDR, false discovery rate; FISH, fluorescence in situ hybridization; TART, Telomere-associated retrotransposon; TE, transposable element. (XLSX) S10 Table. Oligopaint hybridization conditions. Hybridization conditions used for FISH with specific Oligopaints. FISH, fluorescence in situ hybridization. (XLSX) S11 Table. Labeled satellite probes. Information on the fluors used and sequences of satellite FISH probes used in this report. � = "+N" designates the incorporation of an LNA. FISH, fluorescence in situ hybridization; LNA, locked nucleic acid. (XLSX) S12 Table. Unlabeled satellite probes. Information on the 5 0 secondary oligo adapter site and sequence of satellite probes used in this report. (XLSX) S13 Table. Secondary Oligo probes. Sequence and fluors of secondary oligo probes used for fluorescence detection of Oligopaints and unlabeled satellite probes. (XLSX) S14 Table. Universal primers. List of primer sets used for library amplification and G2 probe synthesis. (XLSX) S15 Table. Sublibrary-specific primers. List of primer sets used for sublibrary amplification and Oligopaint synthesis. (XLSX) S16 Table. Chromatin status assignments for contigs. We assigned contigs from the assembly to a chromosome and a chromatin status (heterochromatin/euchromatin, etc., based on [115,116]; see Materials and methods). Blank cells indicate that a region could not be assigned. (XLSX) S17 Table. Overlap between normal and CENP-A overexpression S2 cells. We compared the MACS peaks shared between "normal" S2 (this study) and S2 with CENP-A overexpression using the IDR test. Some noncentromeric regions should have more CENP-A enrichment after CENP-A overexpression; however, only four peaks have IDR � 0.05. None of these peaks have G2/Jockey-3. CENP-A, centromere protein A; IDR, irreproducible discovery rate; S2, Schneider 2. (XLSX) S18 Table. S2 cell FISH quantification. Percentage of probe signals that overlap with different cytological locations ("C": centromere; "P": pericentromere; "H": heterochromatin, and "N": number of spreads analyzed) in S2 cells. The underlying data can be found in S2 Data. FISH, fluorescence in situ hybridization; S2, Schneider 2. (XLSX) S19 Table. S2 cell satellite locations. Summary of the locations of satellite repeats determined by IF-FISH on S2 cell chromosomes X, X;4, 2, cf(2R), cf(2L), 3, 4, and 4 s , using an anti-CENP-A antibody to mark the centromere. Locations were designated as centromeric ("Cen"), pericentric ("Peri"), or heterochromatic ("Het"). See also S18 Table. 4  S3 Data. Oligopaints sequences and information for centromeres X, 3, 4, and Y. The columns indicate the centromere contig ID, start and end coordinates of sequence, followed by the oligo sequence, and the melting temperature (all.oligos.cen.islands). Included are also the same Oligopaint sequences with 5' and 3' extensions containing the universal primer followed by library-specific barcodes (oligos.with.adaptors).