The conserved regulatory basis of mRNA contributions to the early Drosophila embryo differs between the maternal and zygotic genomes

The gene products that drive early development are critical for setting up developmental trajectories in all animals. The earliest stages of development are fueled by maternally provided mRNAs until the zygote can take over transcription of its own genome. In early development, both maternally deposited and zygotically transcribed gene products have been well characterized in model systems. Previously, we demonstrated that across the genus Drosophila, maternal and zygotic mRNAs are largely conserved but also showed a surprising amount of change across species, with more differences evolving at the zygotic stage than the maternal stage. In this study, we use comparative methods to elucidate the regulatory mechanisms underlying maternal deposition and zygotic transcription across species. Through motif analysis, we discovered considerable conservation of regulatory mechanisms associated with maternal transcription, as compared to zygotic transcription. We also found that the regulatory mechanisms active in the maternal and zygotic genomes are quite different. For maternally deposited genes, we uncovered many signals that are consistent with transcriptional regulation at the level of chromatin state through factors enriched in the ovary, rather than precisely controlled gene-specific factors. For genes expressed only by the zygotic genome, we found evidence for previously identified regulators such as Zelda and GAGA-factor, with multiple analyses pointing toward gene-specific regulation. The observed mechanisms of regulation are consistent with what is known about regulation in these two genomes: during oogenesis, the maternal genome is optimized to quickly produce a large volume of transcripts to provide to the oocyte; after zygotic genome activation, mechanisms are employed to activate transcription of specific genes in a spatiotemporally precise manner. Thus the genetic architecture of the maternal and zygotic genomes, and the specific requirements for the transcripts present at each stage of embryogenesis, determine the regulatory mechanisms responsible for transcripts present at these stages.


40
Development is a sequential process, where each step builds on the one before it. The earliest stages of 41 embryonic development are therefore critical, as processes such as cleavage cycles and the beginnings 42 of axial patterning become the basis for all subsequent developmental processes. Regulation of these 43 important tasks is controlled by mRNAs and proteins, and perhaps unsurprisingly then, mRNA levels in 44 Drosophila are found to be precisely controlled during early embryogenesis [1,2]. This precise control of 45 for each species (see Methods). For each gene, we extracted sequences at likely locations for proximal 115 regulatory elements (see Methods). To accommodate the varying annotation quality of the various 116 species, this search encompassed introns, exons, and a 2kb region upstream of the gene. 117 To identify motifs associated with maternally deposited genes, we employed HOMER [29]. For most 118 species, a characteristic pattern emerged where the most enriched motifs were present in the upstream 119 region of the maternally deposited genes, with less enriched motifs appearing in exons (Fig S1). Some 120 motifs, possibly representing repressor binding sites, were enriched in the upstream and intron region 121 of genes that were not maternally deposited as compared to the genes that were maternally deposited 122 ( Fig S1). 123 Analyzing regulatory elements at the post-zygotic genome activation stage (stage 5) presents a 124 challenge, as it is difficult to distinguish newly transcribed zygotic mRNAs from residual maternally 125 deposited mRNAs. At this stage, roughly half of the transcripts present are maternal transcripts that 126 have not yet been degraded [8,[30][31][32]. Therefore, to interrogate regulatory elements associated with 127 zygotic transcription, we restricted our search to genes that do not have transcripts present at stage 2 128 but do have transcripts present by stage 5. Because of these stricter requirements for zygotically 129 transcribed genes, there were far fewer genes in the dataset (66,206 genes in the stage 2 dataset 130 combined from all species, compared to 10,215 total genes in the stage 5 dataset for all species), 131 resulting in a reduction in statistical power. However, without these assumptions, we risk failing to 132 identify signals associated specifically with zygotic transcription amongst the signal of maternal 133 transcription. 134 To determine which proteins are likely to bind to maternal or zygotic motifs, we used Tomtom[33] to 135 evaluate the similarity of the discovered motifs to several motif databases (Table 1) for D. 136 melanogaster. The motifs found in maternally deposited transcripts are similar to those discovered 137 previously in two different contexts: those associated with topologically associated domains (TADs) [34], 138 and those associated with housekeeping promoters [35,36]. This is consistent with existing data showing 139 that functions of maternally deposited genes are enriched for genes with housekeeping 140 activities [36,37]. In order to determine whether the motifs associated with maternal transcripts in our 141 data were simply due to the inclusion of promoter elements from housekeeping genes, we measured 142 the enrichment of these motifs in maternally deposited genes that are not housekeeping genes (see 143 Methods). We found that our motifs are strongly enriched (p < 1e-34) in maternally deposited genes 144 even when excluding housekeeping genes (S6 Figure A). This indicates that these motifs are having a 145 strong effect outside that of those contained in housekeeping genes during this stage. Thus, we 146 hypothesize that the regulatory mechanisms responsible for generating TADs[34] are also responsible 147 for maternal transcripts, and that maternal transcription may be regulated by the establishment of 148 TADs. TADs are genomic regions where the chromatin on one side of the boundary interacts 149 substantially less than expected with the chromatin on the other side, and interactions of DNA elements 150 within the domains can be promoted. While TADs are generally thought to be associated with 151 transcription [34], there is some controversy as to the nature and magnitude of the effect of TADs on 152 gene expression [38], as disruption of TADs has not been found to be sufficient to alter transcription in 153 some cases. 154 The motifs associated with maternally deposited genes are predicted to bind several insulators or 155 architectural proteins. An insulator is a regulatory element that suppresses the interactions of other 156 regulatory elements with genes, or prevents the spread of chromatin state. An architectural protein is a 157 protein that organizes and regulates chromatin structure. The most prominent motif by q-value binds to 158 DNA replication-related element factor (DREF), a known architectural protein and the "master key-like 159 factor for cell proliferation" [39]. It is required for normal progression through the cell cycle. It is known 160 to occur in the promoters of many cell proliferation genes and to interact with chromatin remodeling 161 proteins. Interestingly, DREF binding site overlaps with the binding site for BEAF-32, another well-162 researched protein that acts as an insulator [40,41] that often appears between head-to-head genes 163 (genes with adjacent promoters that get transcribed in opposite directions). Another identified motif is 164 predicted to bind ZIPIC, which is known to bind and recruit CP190, an insulator. A previous study 165 provides evidence for the co-localization of ZIPC and BEAF-32 [42], which likely work together with 166 CP190 to perform insulator functions. Thus of the most enriched motifs in maternal genes (DREF, BEAF-167 32, ZIPC), many have previously identified roles as insulators or in other ways regulating chromatin 168 state. 169 Another maternal motif identified is predicted to bind M1BP (motif-1 binding protein), which causes 170 RNA polymerase II (Pol II) to pause on the gene [43]. Pol II pausing is critical to early zygotic 171 expression[36,44] but its function in producing the maternal transcriptome is unknown. Several 172 functions have been suggested for this Pol II pausing behavior, including maximizing transcription speed 173 once certain conditions are met, synchronizing with RNA processing machinery, reacting to other 174 developmental or environmental signals, keeping chromatin accessible, and acting as an insulator. Given 175 that M1BP is both maternally deposited at high levels and has increased expression in the early embryo, 176 it is possible that M1BP has multiple functions at different time points. During oogenesis, pausing to 177 wait for external signals or RNA processing machinery seems counterproductive to maximizing 178 transcription in the ovary, but the other function of maintaining a state of open chromatin and 179 solidifying TAD boundaries may be very important. In contrast, at stage 5 it may be much more 180 important to maximize expression in response to certain signals. 181 In searching for motifs associated with zygotic expression, we recovered motifs for well-known 182 regulators of the zygotic genome (Table 1). We only identified a small number of highly enriched motifs 183 at this stage, and thus were able to predict a much smaller number of predicted factors binding to these 184 motifs, including Trl (or GAGA factor) and Zelda. Trl is a known early zygotic activator and chromatin 185 remodeler [45][46][47] and Zelda is known as a "master key regulator" to early developmental genes[9,48] 186 and appears to be a pioneer transcription factor that establishes the initial chromatin landscape of the 187 zygotic genome [5] . In addition to these high-quality motifs, we found a large number of motifs with 188 lower quality scores (Table S1). These motifs may regulate spatio-temporal specific genes that we 189 observe in the early embryo, and thus have a lower enrichment score due to our whole-embryo 190 approach being ill-equipped to finding such specific patterns. 191

192
To quantify the conservation of the discovered motifs across the 11 species in our study, we used 193 Tomtom[33] to measure the similarity between the sets of motifs discovered in different species. For a 194 motif to be considered conserved between two species, we required that it be discovered by HOMER in 195 both species and for Tomtom to report a statistically significant alignment score (see Methods). At the 196 maternal stage, we found that high quality (q-value < 1e-100 by HOMER, see Methods) motifs tended to 197 be well-conserved ( Fig 1A) with a large percentage of the total discovered motif content shared across 198 species. We observed that sister species D. pseudoobscura and D. persimilis are unique in that they have 199 the highest number of motifs that are either species-specific or are only shared with each other, and 200 have the fewest number of motifs shared with the rest of the species. This is especially noteworthy 201 considering that this lineage is roughly in the middle of the distribution of divergence times from most 202 of the other species, and thus many more distantly related species comparisons have a higher degree of 203 motif conservation than do any comparisons with these two species. This is consistent with previous 204 results[24] that this lineage has a disproportionately high number of changes in transcript abundance for 205 its phylogenetic position, and suggests that these large number of changes in transcript abundance may 206 be due to the large scale changes in regulation in these species observed here . When comparing the 207 rest of the species, we found a relatively higher number of conserved motifs shared between pairs of 208 species within the Drosophila melanogaster species group (D. melanogaster, D. simulans, D. sechellia, D. 209 yakuba, D. erecta, D. ananassae), and a slightly reduced number of conserved motifs between the D. 210 melanogaster group species and the more distantly related species (D. willistoni, D. mojavensis, D. virilis) 211 ( Fig 1B). At stage 5, we do not observe a high percentage of conserved motifs between species, rather 212 we observe many motifs that are significantly enriched in just one or two species. We also observe little 213 phylogenetic signal in the data, with the only detectable pattern being that the species with the longest 214 divergence time from the rest of the species, D. virilis and D.mojavensis, have slightly fewer shared 215 motifs (Fig 1 C,D). If the unique motifs at either stage indeed represent newly evolved regulatory 216 mechanisms, we expect that these motifs to be rare or to have a smaller frequency difference between 217 transcribed and non-transcribed genes. Either of these effects would raise the false discovery rate as 218 reported by HOMER, which makes the number of species-specific zygotic motifs identified all the more 219 remarkable. Additionally, more highly conserved motifs should require less power to be discovered as 220 they are by definition present across more species, and thus we should have more power to identify 221 them than less-conserved motifs. It is still possible that there are more conserved motifs at the zygotic 222 stage that we do not observe due to the lower number of genes used at this stage. Despite this, 223 however, the dominant signal we find from the motifs we have power to detect is non-conserved. This 224 is underscored by the observation that when we reduce our quality threshold for motifs at stage 5, we 225 still do not observe motifs to generally be conserved across species (Fig S4 B). 226

227
While these results show that some motifs are important to regulation in the genomes of multiple 228 species, do not speak to whether orthologous genes in different species tend to contain similar motifs. 229 To investigate whether regulation was conserved at the level of individual genes, we compared the 230 motif content of each D. melanogaster gene (see Methods) to the motif content of each of its orthologs 231 from other species. We counted motifs as conserved between two species if the motif appeared in both 232 orthologs. For both stage 2 and stage 5, we categorized motifs based on the percent of orthologs for 233 which the motif was conserved (Fig 1E). Motifs have different levels of gene-specific conservation 234 between stages, with maternal stage motifs appearing to have lower conservation across orthologues 235 than zygotic stage motifs, where a larger proportion of orthologues possess the same motif. This is 236 striking, as this seems to imply that while gene expression and regulation are both highly conserved for 237 maternal genes, which genes are regulated by a particular regulator is not.It is possible that the genes 238 that are missing motifs compared to their orthologues are regulated by different motifs, or that the 239 same motifs that are in radically different positions in different species. As many different maternal 240 motifs appear to be regulating transcription at the level of chromatin state, these motifs may be able to 241 function interchangeably. Thus this environment may be more conducive to more motif turnover at this 242 stage but with higher conservation of transcription overall [24], as compared to the zygotic stage. 243

244
While similar binding motifs identified in multiple species implies that regulatory proteins with similar 245 binding domains are acting in these species, we can also verify the similarity in the regulatory machinery 246 by the relative positions of the binding sites relative to the genes they are regulating. To investigate 247 whether the discovered motifs had the same positional relationship with the transcription start site 248 (TSS) across all species, we generated position frequency data for each motif. For each gene, we 249 examined each position starting from 2kb upstream of the TSS to the 3' end of the gene body, and 250 whether there was a motif at that position. Many of the most prominent motifs shared a similar 251 distribution pattern, characterized by a strong peak at -100bp, and sometimes a secondary peak at -252 340bp ( Fig S2). To quantify this similarity, we performed an Anderson-Darling test on each motif for 253 each pair of species, which indicated that 65% (stage 2) and 91% (stage 5) of motif distributions are 254 identical between species (percent of motifs for which p < .05). This suggests conservation of the 255 relationship between binding to these motifs and initiation of transcription. The higher conservation of 256 motif position in stage 5, which has fewer conserved motifs between species than stage 2, may be 257 consistent with this stage having more gene-specific regulation, as discussed further below. 258

259
While some studies focus on finding motifs with a particular orientation relative to their proximal 260 genes[49], there is some evidence that motifs do not behave in a strand-specific manner [50]. To 261 evaluate the importance of the strandness of the discovered motifs, we generated a regression to 262 predict expression level that differentiated between forward and reverse versions of each motif (see 263 Methods). This regression indicated a significant difference between the forward and reverse versions of 264 many motifs. For example, we found the E-box motif affects the log-odds of maternal deposition by .192 265 in the forward orientation but only .115 in the reverse orientation (t-test, p < .001). For almost all 266 motifs, different strands had the same qualitative effect on expression, but with different magnitudes, 267 indicating that while motifs had the same effect regardless of orientation, their efficiency could be 268 increased if the orientation was optimal. 269 While the strandedness of motifs may play a small role in their overall effect, we want to know if 270 strandedness makes a qualitative difference to our motifs effects on transcript level, and if we can use 271 motif strand to improve our model. To determine this, we ran HOMER exclusively on the same strand 272 that the gene appeared on, rather than the default mode of scanning both strands. This resulted in the 273 same set of motifs being discovered. This is consistent with the regression results that show that each 274 motif, whether located on the positive strand or the negative strand relative to the transcription start 275 site, has the same qualitative effect on gene expression, indicating that the direction of each motif had 276 minimal effect on expression. To evaluate whether the strand the motif was located on relative to the 277 gene was predictive in whether a gene was transcribed at a particular stage , we constructed another 278 regression using only the data from the same-strand motifs. This regression performed less well than 279 the regression using motifs from both strands (AIC = 7915.8 for the unstranded regression, AIC = 8612.5 280 for the stranded regression for a representative species D. ananassae). Overall, this suggests that motif 281 binding elements need not bind in a strand specific manner to induce their effects, though the optimal 282 orientation provides measurable increase in their effect on transcription. This result is the same at both 283 stage 2 and stage 5. 284

285
While we have identified a set of motifs that together seem to be responsible for early embryonic 286 RNA content, we next asked if these motifs are likely to be regulating genes with specific types of 287 functions. To this end, we performed gene ontology (GO) analysis on groups of genes, based on their 288 motif content. To simplify this analysis, we chose to focus on the top 8 motifs as reported by HOMER, 289 and for each of the 8 motifs, we performed GO analysis on the transcript pools at each stage as well as 290 on each motif individually [51,52]. We initially performed a GO analysis on both the maternally deposited 291 and zygotically transcribed transcript pools, disregarding motif content. When comparing stages, we 292 observe no overlap between GO terms (Fig 2A), which is consistent with our expectations that the genes 293 that are activated in the zygote have different functionality to those transcripts that are maternally 294 deposited, especially as our definition of zygotically transcribed genes excludes genes present in stage 2. 295 When examining genes containing specific motifs within each stage, we observe that many of the stage 296 2 motifs show a similar pattern in the GO categories they are associated with, with the strongest 297 associations belonging to the DREF motif, which is strongly associated with most identified categories 298 ( Fig 2B). This could be an indication that there is a high degree of homogeneity in terms of the types of 299 genes these motifs may regulate. In contrast, the stage 5 motifs present in zygotic-only genes show 300 more variety in the GO terms of genes they are associated with (Fig 2C), which could be indicative of 301 more specific regulation for these genes at this stage. 302 While the previous GO analysis indicated that the top motifs at stage 2 display significant overlap in 303 associated GO categories, this does not exclude the possibility that specific GO categories are regulated 304 by specific motifs. To search for more specific motifs, we performed motif analysis using HOMER to find 305 overrepresented sequences in the top GO terms within maternally deposited genes, resulting in several 306 motifs which are enriched in specific GO terms (Fig. S5), though very few of them are significantly 307 enriched after multiple test correction. These motifs do not appear in other analyses, and do not have 308 strong matches to proteins expressed in the ovary found in the literature. Because these motifs are 309 associated with a small subset of genes, we hypothesized that these motifs confer specificity to 310 transcription of specific genes with accessible chromatin. To determine whether these motifs are 311 associated with increased expression at stage 2, we used linear models to measure the effect of the 312 presence of these motifs, specifically in genes that already contain motifs that bind to architectural 313 proteins, or whose adjacent genes are highly expressed. We did not find that the presence of these GO 314 term-specific motifs increased the odds of maternal deposition ( Fig S5). It is possible that this result is 315 due to the lack of statistical power surrounding these motifs, as these motifs are somewhat rare. This 316 result could also be the underlying biology, however, and these motifs could be non-functional at stage 317

318
Predicted maternal motif binding proteins are enriched in the ovary 319 Next, we investigated whether the potential motif binding proteins we identified were plausible 320 regulators of maternal deposition. It is unclear whether the motifs we identified as enriched in 321 maternally deposited genes are associated specifically with maternal deposition, given that chromatin 322 regulators are important at all stages in all tissues. To investigate, we used modENCODE [53] transcript 323 abundance data to compare the mRNA transcript levels for proteins predicted to bind our discovered 324 motifs, and found increased expression in ovaries ( Fig 3A)as compared to other tissues sampled. This 325 pattern exists, though to a lesser extent, in the FlyAtlas 2 dataset[54], which is a tissue-specific database 326 of transcript levels that utilizes RNA-seq data rather than microarray analysis. The discrepancy between 327 the two datasets could be due to the differences in gene expression measurement method or in 328 experimental methods. The transcripts for these proteins also show moderately high abundance in our 329 own dataset (File S5). While it has been demonstrated that mRNA levels do not necessarily mirror 330 protein levels[55], the enrichment of mRNA in ovaries compared to other tissues is reasonable evidence 331 that these proteins are important in ovaries. 332 To investigate whether these proteins are acting to affect transcription in the ovaries specifically, we 333 examined the expression profiles of RNA in various tissue types (referenced in Fig 3B) from existing RNA 334 quantification datasets [53,56]. For each instance of a motif of interest, we extracted the transcript level 335 from within a 20kb window surrounding the motif and measured the normalized relative transcript level 336 for each position (an example of this is shown in Fig 3B). While the relative normalized transcript level 337 changes in each of the measured tissues, the effect is strongest in ovaries, indicating that the presence 338 of one of these binding sites is associated with a higher increase in transcript levels in the ovary 339 compared to other tissues. 340 As the motifs associated with maternal transcription also act to some degree in other tissues, we next 341 wanted to ask whether the motifs were more enriched in maternally deposited genes than in genes 342 expressed in other tissues. To determine whether regulation in different tissue types were associated 343 with different motifs, we ran HOMER in the same manner as with the maternal stage data to discover 344 enriched motifs (see Methods) in transcripts present in other tissues, as identified from ModENCODE 345 data [57]. We found that most other tissue types were also enriched in the same motifs discovered in 346 transcripts present in stage 2 embryos. However, examining the frequency of motifs in specific genes 347 revealed that the majority of those motifs were from genes that were shared between those tissue 348 types and stage 2 embryos. When we exclude genes that are expressed in stage 2 embryos, HOMER fails 349 Maternally deposited genes are physically clustered on the genome 358 In addition to motifs, we observed several other effects that were related to early embryonic RNA 359 content. Given that many of our discovered motifs bind architectural proteins, we hypothesize many 360 effects may be linked to the physical location of genes on the chromosome. We examined the positional 361 distribution of transcribed genes in various tissue types ( Fig 4A). As previous papers utilizing the Hi tissue of the previously described tissue types. While all tissues examined showed a strong preference 368 for groupings of transcribed genes, embryonic stage 2 samples were the most highly grouped (Fig 4B). 369 This result was robust to changes in the threshold of what is considered to be expressed (see Methods). 370 This pattern of physical co-expressed gene clustering on the chromosome is consistent with our model 371 of regulation via architectural proteins. 372 While these results speak to the pattern of clustering of expression for maternal genes in terms of 373 adjacent genes being on or off, they do not account for the distance between genes. To answer the 374 question of whether this clustering phenomenon is dependent on distance, we examined the distance to 375 adjacent genes. We observed a trend whereby proximity to an active promoter increases the odds of 376 maternal deposition (Fig 4C). This effect was slightly affected by the strandedness of the two genes 377 whereby genes that have an opposite orientation are more likely to have different expression. This is 378 consistent with observations from previous studies[34] that consecutive genes on the same strand were 379 more likely to show co-expression, while consecutive genes on opposite strands were more likely to 380 have different expression. 381 Many previous studies have observed that zygotic genes tend to be short in length [24,30,64,65]. In 382 addition to affecting transcription speed, shorter gene lengths result in a smaller distance between 383 transcriptional units along the chromosome, especially when considering which strand the gene is on. To 384 explore gene length in maternal genes and the relationship between gene length and the position on 385 the chromosome, we measured the maternal deposition rates with respect to gene length. We observed 386 a trend that in most species, shorter genes are less likely to be maternally deposited. There are 387 differences in the length of maternal genes across species, and this trend could be partly due to the bias 388 for more highly annotated genomes to be enriched in shorter genes ( Fig 4D). Additionally, chromatin 389 context seems to heavily influence this effect: when the adjacent genes are off, gene length is much 390 more important (Fig 4C) and very short genes are very likely to be off. This could be because shorter 391 genes are more likely to be influenced by the regulatory machinery of a nearby gene. Alternatively, 392 longer genes might be long enough to physically isolate themselves more effectively and establish their 393 own unique regulatory environment. 394 Given that a number of motifs found in this study are bound by proteins annotated as insulators, and 395 the motifs are similar to those that are associated with TADs, we asked where the motifs found in our 396 dataset can be found relative to TAD boundaries. Previous results suggest that architectural proteins are Stage-specific genes are isolated on the genome 408 Given that maternally deposited genes are physically clustered together in the genome, we wanted to 409 examine if this pattern held with the set of genes that were stage-specific. To determine if consecutively 410 expressed cluster size is related to stage-specificity of transcript representation, we examined maternal-411 only (transcripts present at stage 2 and entirely degraded by stage 5) and zygotic-only genes (transcripts 412 present at stage 5, not present at stage 2; for both stage-specific categories, see Methods for further 413 definitions) and their frequencies in clusters of different sizes. We determined that for most species, in 414 contrast to all maternally deposited genes, both maternal-only and zygotic-only genes are more likely to 415 be in smaller (1-3 consecutive active genes) groups than in larger groups (more than 3 consecutive 416 active genes) (Fig 5, A and B). For these stage-specific genes, this could be an indication that control of 417 stage-restricted genes is more specific, affecting single genes rather than larger clusters. Results for 418 most other analyses of maternal-only genes were unable to be obtained due to the very low number of 419 genes in this category (see Methods). 420

421
In Drosophila, transcription start sites are frequently associated with a spike in GC content. These spikes 422 in GC content have been suggested to act as "genomic punctuation marks" to delineate functional 423 regions, though their mechanisms of action are not clear [66]. To explore this phenomenon with respect 424 to the two developmental stages we examined, we evaluated the average GC content of upstream 425 regions for genes in stage 2 and stage 5. When comparing the GC-content of putative cis-regulatory 426 sequences in maternally versus non-maternally deposited genes, we observed an increase in GC-content 427 upstream of the TSS (Fig S3), as well as a dip in GC content ~200bp upstream of these genes. In contrast, 428 this modulation does not occur in genes that are off at both stage 2 and stage 5, nor in genes that are 429 off at stage 2 but activated at stage 5. To determine whether this modulation of GC-content was 430 predictive of maternal deposition, we constructed four generalized linear models using the GC-content, 431 the motif data, and both the motif data and GC-content as data sources (see Methods). Adding the GC-432 content to the model that already included motif data improved the model (AIC: 185589 without GC 433 content AIC:183079 with GC content), hence increased GC content upstream of TSS is somewhat 434 predictive of maternal deposition, even when accounting for motif presence in this region. 435 The biological significance of this spike in GC content is unclear. Previous research has shown that the maternal and zygotic mRNA expression profiles of different 455 species of Drosophila are generally conserved, but with some noticeable differences [24]. To investigate 456 the regulatory basis of transcription at these stages, we leveraged a large comparative dataset to 457 identify the transcription factor binding motifs found in the cis-regulatory sequences of these genes. We 458 found that the regulatory basis of both the maternal and zygotic-only transcripts also had significant 459 conservation, which permitted the discovery of common features of gene regulation across Drosophila. 460 Specifically, we identified transcription factor binding motifs that are associated with mRNA expression 461 across species for maternally deposited transcripts and zygotically expressed transcripts. We also 462 investigated the effects of other regulatory mechanisms such as chromatin state on maternal and 463 zygotic expression of mRNAs, as well as the association of transcript levels at these two stages of 464 embryogenesis with gene length, strandedness, and GC content. 465 Generally, we found a number of conserved transcription factor binding motifs associated with 466 transcript abundance for both the maternal and zygotic-only transcripts. At the maternal stage, there 467 were a larger number of more highly conserved motifs than were found for the zygotic-only genes. This 468 is consistent with a previous study that found that maternal transcripts themselves were more highly 469 conserved than transcripts at the zygotic stage [24] . Given this, surprisingly we also found less 470 conservation of particular motifs at conserved genes transcribed at the maternal stage. There is some disagreement on the effect that TADs have on gene expression, however. Ghavi-helm et 499 al [73] demonstrate that the disruption of TADs does not necessarily disrupt the constituent gene 500 expression. Instead, they suggest TAD boundaries acting to prevent interactions between TADs is rare or 501 tissue specific. Others suggest that it is possible that TADs are increasing robustness to other regulatory 502 mechanisms [74]. Because TAD-associated elements appear to be associated with maternal deposition 503 in our dataset, we hypothesize that these elements are regulating maternal deposition via chromatin-504 level control. It is possible that there are other additional mechanisms that we do not detect. 505 To understand the connection between architectural proteins and maternal deposition, we need to 506 examine where these transcripts are produced to understand the cellular context. In the ovary, nurse 507 cells are responsible for the transcription of maternally deposited genes, and there is a considerable 508 body of literature devoted to nurse cell biology. Much study has been directed towards elucidating how 509 nurse cells transport their products into the oocyte and how post translational control mechanisms fine- Given the amount of overlap between the motifs enriched in the cis-regulatory regions of maternally 522 deposited genes and the motifs associated with TAD boundaries, it is possible that these same 523 architectural proteins are functioning to define which genes are maternally transcribed and then 524 deposited into the embryo. We found that the maternally deposited genes are highly clustered on the 525 genome, which is indicative of control via architectural proteins. Additionally, we uncovered that 526 proximity to nearby expressed genes is highly correlated with expression. We also identified a pattern 527 whereby the relative strandedness of adjacent genes is indicative of whether they will be maternally 528 deposited, which is a pattern that has been previously observed with insulators[34]. Each of these 529 results is consistent with known behavior of architectural proteins, suggesting that expression at stage 2 530 is controlled locally on the chromosome by activating TADs rather than specific genes. 531 As architectural proteins are important in determining genome organization and regulating transcription 532 to some degree in all tissues and stages, we investigated whether the regulatory patterns we observed 533 for maternal genes were ovary-specific or shared across all stages and tissues. Many of the motif binding 534 elements discovered in this analysis appear to be enriched in ovaries, although these proteins have 535 important functions in other tissues as well. Some of the proteins predicted to bind our motifs have 536 been noted for being enriched in the regulation of housekeeping genes, and as maternally deposited 537 genes themselves are enriched in housekeeping genes, this result is perhaps unsurprising. A number of 538 studies have suggested that in addition to the common architectural proteins shared across conditions 539 and developmental stages, there may exist tissue-specific architectural proteins that integrate into the 540 canonical protein complex to produce tissue-specific TAD patterns [79][80][81]. Perhaps this is the case with 541 the ovary, and further study will reveal whether there are ovary-specific factors that may interact with 542 the common architectural proteins whose binding sites we find enriched here. For example, the authors 543 of Mataz et al. 2012 [82] suggest that Shep may be a tissue-specific factor interacting with architectural 544 proteins in the central nervous system. The enrichment Shep in the central nervous system is even less 545 extreme than the enrichment we observe of CP190 (a known interaction partner of ZIPC, one of our 546 maternal expression associated motifs) in ovaries, suggesting that CP190 could also qualify as tissue-547 specific. Alternatively, the polyploid nature of nurse cells and the extensive and rapid transcription that 548 occurs in these cells may instead provide an extreme enrichment of the common architectural proteins, 549 without the need for stage or tissue specific architectural proteins.

562
Our examination of motifs that are associated with zygotic mRNA expression revealed several previously 563 discovered motifs, including those that bind Zelda and GAGA factor (Trl). Additionally, several motifs are 564 likely binding sites for other well-characterized developmental proteins (Table S1) which are sometimes 565 highly localized in the embryo. If transcripts are produced in a spatially localized manner, they are 566 necessarily not expressed in the entire embryo, and thus their signal may be more difficult to detect in 567 our data from whole embryos. Overall, we observe few motifs at stage 5 that are conserved across 568 species, in comparison to motifs for maternally deposited genes. However, the motifs that we do find at 569 stage 5 tend to higher conservation within specific genes than the motifs we discover at stage 2. This 570 highlights that it may be more important for specific genes to have precise signals after ZGA. 571 Additionally, in our zygotic analysis, we focused only transcripts that are present at stage 5 and do not 572 have a maternal component, as many maternally deposited transcripts are still present at stage 5 573 (roughly half of maternal transcripts are still present at this stage[8,30-32]). Because many maternal 574 transcripts are still present, analysis of the total stage 5 transcriptome would largely recapitulate the 575 stage 2 results, especially as stage 5 transcripts are much more likely to be expressed in specific spatio-576 temporal patterns, which to our whole-embryo analysis would appear as low or noisy signal. Our 577 decision to remove transcripts with maternal deposition highlights the signals that are unique to stage 5, 578 but comes at the cost of an overall reduction in the number of genes available for analysis, resulting in 579 higher false discovery rates for all motifs. 580

581
In this study, we examined regulatory elements associated with maternal transcripts present at stage 2 582 of embryogenesis and zygotic transcripts present at stage 5 across species of Drosophila. At both stages, 583 we found regulatory motifs that are conserved throughout the ~50 million years of divergence 584 represented by these species, which speaks to a conservation of regulatory mechanisms across the 585 genus. In general, the high degree of conservation in regulatory elements at the maternal stage and the 586 zygotic stage, while different from one another, speaks to the critical nature of the complement of 587 transcripts present to direct early embryogenesis. The differing patterns observed in the obscura group 588 species (D. pseudoobscura and D. persimilis), and the regulatory basis of changes in transcript 589 representation between species are the subject of ongoing study. At the maternal stage, we found many 590 regulators that appear to be defining general regions of the genome to be transcribed via chromatin 591 regulation through architectural proteins and likely at the level of TADs. Given the exceptionally high 592 level of conservation of maternal transcript deposition, the relatively non-specific mechanism of 593 maternal gene regulation appears contradictory. In contrast, we found zygotic regulatory elements to be 594 considerably more gene-specific. The different patterns of regulation for transcripts present at these 595 two stages of embryogenesis is consistent with the specific transcriptional contexts of these two 596 genomes, with the non-specific mechanism active in highly transcriptionally active polyploid nurse cells 597 in oogenesis in the mother, and the gene-specific mechanism acting in the zygote where transcription is 598 often localized in time and space. To determine which genes were orthologues, we used the FlyBase orthology table 615 "gene_orthologs_fb_2014_06_fixed.tsv". 616

617
Preliminary tests were performed to determine which regions were most likely to have regulatory 618 elements. For each gene, several regions were extracted: 10kb upstream,5kb upstream, 2kb upstream, 619 1kb upstream, 500bp upstream, 5' UTR, total introns, total exons, and 3' UTR. For each region, 620 boundaries were obtained from the appropriate GTF and sequences were extracted using BioPython 621 (Version 1.73,[87]). The 2kb upstream region showed the highest quality motifs (Fig S1), and thus were 622 used for matching motifs in external databases, measuring motif overlap between species, analyzing 623 motif position distributions, and GO analysis. For these analyses,featured in figures 1 through 3, UTRs 624 were ignored as not every species had annotated UTRs. 625

626
We used HOMER[29] to discover motifs in test sets using the background sets as control FASTA files, test 627 and background sets are defined below. Deviations from the default settings include the use of the -628 fasta flag to specify a custom background file. For stage 2 queries, the test FASTA files included genes 629 that had a FPKM >= 1 at stage 2 while the control FASTA files included genes that had an FPKM < 1. For 630 the stage 5 queries, the test FASTA files contained genes where the stage 5 FPKM >= 1 and the stage 2 631 FPKM < 1, while the control FASTA files included genes whose stage 5 FPKM < 1 and stage 2 FPKM < 1. 632 Additionally, we used the -p flag to utilize our computational resources more efficiently. We used -633 norevopp flag in the case of strand-specific searches. Motif quality was evaluated based on the HOMER-634 outputted q-values. 635 To validate the HOMER output files we used MEME[33] v4.12.0 and RSAT [88]. MEME was run using-mod 636 zoops -nmotifs 2 -minw 8 -maxw 12 -revcomp. The RSAT analysis uses the purge-sequences tool, 637 followed by oligo-analysis using the following parameters: -lth occ_sig 0 -uth rank 5000 -return 638 occ,proba,rank -2str -noov -quick_if_possible -seqtype dna -l 8, followed by pattern-assembly using the 639 following parameters: -v 1 -subst 1 -toppat 5000 -2str, followed by matrix-from-patterns using the 640 following parameters: -v 1 -logo -min_weight 5 -flanks 2 -max_asmb_nb 10 -uth Pval 0.00025 -bginput -641 markov 0 -o purged_result. 642

643
For analyses of zygotic transcripts, such as the motif analysis, we defined genes as being zygotic-only if 644 they were off at stage 2 (FPKM <1) and on at stage 5 (FPKM >1), for N=10,215 genes across all species. It 645 is necessary to impose such a restriction, as a large percentage (approximately 85%) of genes that are 646 zygotically expressed were also maternally deposited, and analysis of stage 5 regulatory mechanisms 647 would be confounded the signal of stage 2 genes. For analyses of maternal-only transcripts, we define 648 maternal only if they are on at stage 2 (FPKM >1) and off at stage 5 (FPKM <1). As the class of maternal-649 only genes is very small (N=3194 across all species), we were unable to obtain results for some analyses 650 such as the motif content detection and GO analyses for this group of genes. 651

652
To determine weather motifs were shared between species, the HOMER-formatted motifs were 653 converted to meme-formatted motifs using chem2meme from the MEME Suit [33]. Tomtom, also from 654 the MEME Suit, was then used to find matching motifs, using default parameters. For a motif to be 655 considered shared with another species, the Tomtom output threshold of α = .05 was used. this 656 technique was used to calculate the similarity of motifs found in different species, as well as to evaluate 657 the similarity of different motif discovery strategies using MEME, RSAT, or HOMER with alternative 658 parameters. 659 To refine the results of shared motifs, we applied an additional quality cutoff. For stage 2, motifs were 660 first filtered for a q-value of less than 1e-100, and for stage 5, motifs were first filtered for a q-value of 661 1e-10. The difference in the cutoffs used at the two different stages was due to the differences in the 662 overall distribution of q-values for these stages due to a reduced number of zygotic-only genes (see 663 zygotic-only motifs above). 664 Because sharing was calculated on a by-species basis, it is possible that one species has a motif that 665 meets the criteria for being shared among all other species while other species' version of that same 666 motif failing to meet the criteria. This can occur, for example, when a motif is an intermediary version of 667 two motifs that fall just outside the cutoff. 668 To find proteins that bind to the discovered motifs, we used Tomtom to query JASPAR and Combined 669 Drosophila Databases using the default parameters [89]. 670 Zygotic-only motifs section above for definition). This threshold approximates the percent of the 706 genome that we observed to be zygotic-only. We then performed an enrichment analysis using 707 enrichGO's default parameters using a background set of D. melanogaster genes that are not maternally 708 deposited in at least two species. This analysis therefore specifically examines the zygotically activated 709 genes in the context of genes that are "off" at stage 2 ( FPKM<1 at this stage). For our analysis of stage 2 710 motifs, we generated a test set for each motif consisting of genes that contained that motif in at least 711 two species and were maternally deposited (FPKM > 1) in at least two species. We then performed an 712 enrichment analysis using enrichGO's default parameters using a background set of all D. melanogaster 713 genes. For our analysis of stage 5 motifs, we generated a test set for each motif using genes that were 714 represented by transcripts >1 FPKM at stage 5 in at least two species and had the motif of interest in at 715 least two species. We then performed an enrichment analysis using enrichGO's default parameters 716 using a background set of D. melanogaster genes that were represented by transcripts >1 FPKM at stage 717 5. To visualize our results, we employed the dotplot method for enrichGO objects, also from the 718 clusterProfiler package. For each motif, the top 3 GO terms were identified and added to the y-axis 719 labels. Whenever any GO category from another motif was identified as statistically significant (α = .05), 720 that GO category was shaded appropriately. 721

Motif Position and Count
To discover motifs associated with particular GO categories, we generated a list of genes that were both 722 maternally deposited and associated with each GO term of interest, as well as a list of genes that were 723 maternally deposited but not associated with the GO term of interest. For each GO term, we ran 724 HOMER using the same parameters as the initial motif discovery, using the genes associated with the 725 GO term as the test list and the genes not associated with the GO term as the background. We restricted 726 this analysis to the upstream regions of Drosophila melanogaster genes. 727

728
Logistic regression was performed using the "glm" function in R, using the logit link function. As inputs, 729 we used the list of motifs generated from HOMER and their counts as described in the "Motif Position 730 and Count" section above. To avoid redundant motifs in our model, only motifs of size 10 were 731 considered. To evaluate the strand-specificity of motifs, we compared two generalized linear models 732 using the formulas indicated in S1_Model_Generation.pdf. To identify the most important motifs, the R 733 function stepAIC from the MASS library 7.3-51.4[92] was used to find generate an ordered list of motifs. 734 The base model used contained no additional features (chromatin state, etc). StepAIC was run 8 steps to 735 generate a short list of motifs for evaluation. 736

737
To evaluate the effect of gene cluster size on expression, we iterated through each species for both 738 stage 2 and stage 5 and assigned sizes of co-expressed gene clusters on the chromosome, based on how 739 many adjacent genes were coexpressed, resulting in cluster size frequencies for each genome. Errors 740 were calculated using 95% confidence interval for a two-tailed binomial distribution. 741 To compare the clustering of different datasets with varying percents of "on" genes, we employed the 742 Wald-Wolfowitz runs test. 743 extracted, and the number of G and C nucleotides was counted. The result was divided by 50 to get the 762 %GC for this window. To calculate the GC content for the next bin, this process was repeated on the 763 region from -2bp to -51bp. Each bin had its GC content evaluated this way until the final bin of -451bp to 764 -500bp. To evaluate how closely a particular upstream region resembled a maternally deposited-like 765 distribution or a non maternally deposited-like distribution for the purposes of modeling, we calculated 766 the average GC content for each position of maternally deposited, and not maternally deposited genes. 767

Tissue-specific RNA Levels
Then for each gene, we measured the correlation between the GC content and that of both category 768 averages. We used the difference in these correlations as a metric to evaluate similarity in GC Table 1: A summary of the top ranked motifs. HOMER was used to find motifs enriched in the 2kb 984 windows upstream of maternally deposited genes (stage 2) and zygotically transcribed genes (stage 5). 985 Sequence logo shows the consensus motif where the probability of each base is proportional to its 986 representative character. P-value is given by HOMER. %target represents the percent of either 987 maternally deposited or zygotically expressed genes that contain at least one instance of the motif. 988 %background indicates the percent of all genes that contain this motif. Best match indicates protein 989 with a previously identified binding site that mostly closely matches the discovered motif (see Methods). 990