Widespread conservation and lineage-specific diversification of genome-wide DNA methylation patterns across arthropods

Cytosine methylation is an ancient epigenetic modification yet its function and extent within genomes is highly variable across eukaryotes. In mammals, methylation controls transposable elements and regulates the promoters of genes. In insects, DNA methylation is generally restricted to a small subset of transcribed genes, with both intergenic regions and transposable elements (TEs) depleted of methylation. The evolutionary origin and the function of these methylation patterns are poorly understood. Here we characterise the evolution of DNA methylation across the arthropod phylum. While the common ancestor of the arthropods had low levels of TE methylation and did not methylate promoters, both of these functions have evolved independently in centipedes and mealybugs. In contrast, methylation of the exons of a subset of transcribed genes is ancestral and widely conserved across the phylum, but has been lost in specific lineages. A similar set of genes is methylated in all species that retained exon-enriched methylation. We show that these genes have characteristic patterns of expression correlating to broad transcription initiation sites and well-positioned nucleosomes, providing new insights into potential mechanisms driving methylation patterns over hundreds of millions of years.


Introduction 1
In most organisms DNA bases are adorned with a variety of chemical modifications. Amongst the 2 most common of these is methylation at the 5 position of cytosine (C5me), which is present from 3 bacteria to humans (Ponger and Li, 2005;Casadesús and Low, 2006;Jurkowski and Jeltsch, 2011). 4 In eukaryotes, a key property of cytosine DNA methylation is its ability to act epigenetically -that is, 5 once introduced, methylation at specific cytosines can remain in place through cell division (Holliday 6 et al., 1987;Holliday, 2006). This relies on the activity of "maintenance" methyltransferases, DNMT1 1 0 and highly expressed genes lacked methylation suggesting that neither conservation nor expression 1 is sufficient to explain gene body methylation.
2 Nucleosome positioning influences DNA methylation levels across arthropods 3 In order to investigate molecular mechanisms that might be responsible for influencing DNA 4 methylation we examined how the correlation in methylation between pairs of CpGs varied with 5 increasing separation. In many species with exon-enriched methylation the correlation coefficient 6 between methylation levels of individual CpGs oscillated periodically ( Figure 6A,B). Fourier analysis 7 showed that the period of oscillation was ~160 nucleotides, roughly corresponding to the average 8 nucleosome repeat length ( Figure 6A,B; Figure S6-1). We quantified this nucleosome-length 9 periodicity within exons across all species. While the majority of species with exonic methylation 1 0 displayed a nucleosome periodicity signal, its magnitude varied greatly -for example H. melpomene 1 1 has gene methylation but less apparent periodicity ( Figure 6B). Interestingly a clear signal of 1 2 periodicity was also seen for TE methylation in S. maritima and P. citri, both of which have high levels We wondered whether the periodicity in correlation between methylated DNA might reflect an To annotate rRNA, we either used existing annotations or RNAmmer v1.2 (Lagesen et al., 2007). To 1 2 annotate tRNA, we either used existing annotations or tRNAscan-SE v1.3.1 (Lowe and Eddy, 1997).

3
To avoid ambiguous results caused by overlapping features, we screened out any TE annotations 1 4 that overlapped any rRNA, tRNA or exon, and any upstream regions which overlapped any TE, rRNA, 1 5 tRNA or exon.

6
Whole genome bisulphite sequencing 1 7 To measure DNA methylation on a genome-wide scale, we carried out whole-genome bisulphite 1 8 sequencing. We used the DNeasy Blood and Tissue kit (QIAGEN) according to the manufacturer's 1 9 protocol to extract DNA from adult somatic tissues of the following species: L. polyphemus, P. protocol (see Supplementary Table 1 for detailed sample metadata and sequence accession codes).

7
We sequenced these libraries on an Illumina HiSeq 2500 instrument to generate 100bp paired-end 2 8 1 6 reads. We used pre-existing whole-genome bisulphite sequencing datasets for P. hawaiensis 1 (SRR3618947, (Kao et al., 2016)) and A. mellifera (SRR1790690, (Galbraith et al., 2015)). intervals for the mean methylation of genes and TEs within each species using 1000 nonparametric 2 7 bootstrap replicates (i.e. genes or TEs were resampled with replacement 1000 times to generate an 2 8 empirical distribution of the mean). To infer the ancestral levels of genome-wide methylation across 29 species of arthropods with newly-2 produced or publicly-available methylation data (Figure 1), we obtained a time-scaled species tree 3 from TimeTree (www.timetree.org, accessed 12.03.2019). We then used a maximum-likelihood 4 approach to infer the genome-wide methylation level at all internal nodes of this tree based on the 5 levels at the tips, using the fastAnc function within phytools (Revell, 2012).

6
To infer the ancestral levels of gene-body and TE methylation for the 14 focal species, we constructed 7 a Bayesian time-scaled species tree for 14 focal species (Figure 2 &  an existing phylogenetic analysis of arthropods (Misof et al., 2014). We ran the analysis for 10 million 1 8 generations, and used TreeAnnotator (Drummond et al., 2012) to generate a maximum clade 1 9 credibility tree. We then used a maximum-likelihood approach to infer the gene-body and TE 2 0 methylation levels (separately) at all internal nodes of this tree, using the fastAnc function within 2 1 phytools (Revell, 2012).

2
To test whether genome-wide methylation levels differ between species with and without ALKB2, we branch lengths of the time-scaled (ultrametric) species tree (see above) to calculate a genetic 2 6 distance matrix, and included this in the model as a random factor. We ran the analysis for 6 million 2 7 iterations, with a burn-in of 1 million iterations and thinning of 500 generations.  To test whether variation in tissue-specific expression differs between highly-and lowly-methylated To obtain an estimate of how the correlation between the methylation levels of sites varied with 1 9 distance between the sites, we collected all pairs of sites separated by d nucleotides where d could between 140 and 200 base pairs was calculated to give the nucleosome periodicity for each species.  reflects expression levels. Raw tags were normalised using a referent power-law distribution and 2 0 expressed as normalized tags per million (TPMs). Biological replicates were highly correlated (r 2 = 2 1 0.99) and were therefore merged prior to downstream analyses using standard Bioconductor 2 2 packages (http://www.bioconductor.org/) and custom scripts.

3
CTSSs were clustered together into tag clusters, a single functional transcriptional unit, using 2 4 distance-based clustering, with the maximum distance allowed between adjacent CTSSs being 20 bp.

5
For each tag cluster, the interquantile width was calculated as the distance between CTSSs at the Collins, R. Kilner, A. Pinharanda and (   v  o  l  u  t  i  o  n  a  r  y  d  i  v  e  r  s  i  f  i  c  a  t  i  o  n  o  f  D  N  A  m  e  t  h  y  l  t  r  a  n  s  f  e  r  a  s  e  s  i  n   3   e  u  k  a  r  y  o  t  i  c  g  e  n  o  m  e  s  '  ,   M  o  l  e  c  u  l  a  r  B  i  o  l  o  g  y  a  n  d  E  v  o  l  u  t  i    Accession of each sample that was newly-sequenced in this study.