Profiling DNA methylation differences between inbred mouse strains on the Illumina Human Infinium MethylationEPIC microarray

The Illumina Infinium MethylationEPIC provides an efficient platform for profiling DNA methylation in humans at over 850,000 CpGs. Model organisms such as mice do not currently benefit from an equivalent array. Here we used this array to measure DNA methylation in mice. We defined probes targeting conserved regions and performed differential methylation analysis and compared between the array-based assay and affinity-based DNA sequencing of methyl-CpGs (MBD-seq) and reduced representation bisulfite sequencing. Mouse samples consisted of 11 liver DNA from two strains, C57BL/6J (B6) and DBA/2J (D2), that varied widely in age. Linear regression was applied to detect differential methylation. In total, 13,665 probes (1.6% of total probes) aligned to conserved CpGs. Beta-values (β-value) for these probes showed a distribution similar to that in humans. Overall, there was high concordance in methylation signal between the EPIC array and MBD-seq (Pearson correlation r = 0.70, p-value < 0.0001). However, the EPIC probes had higher quantitative sensitivity at CpGs that are hypo- (β-value < 0.3) or hypermethylated (β-value > 0.7). In terms of differential methylation, no EPIC probe detected a significant difference between age groups at a Benjamini-Hochberg threshold of 10%, and the MBD-seq performed better at detecting age-dependent change in methylation. However, the top most significant probe for age (cg13269407; uncorrected p-value = 1.8 x 10−5) is part of the clock CpGs used to estimate the human epigenetic age. For strain, 219 EPIC probes detected significant differential methylation (FDR cutoff 10%) with ~80% CpGs associated with higher methylation in D2. This higher methylation profile in D2 compared to B6 was also replicated by the MBD-seq data. To summarize, we found only a small subset of EPIC probes that target conserved sites. However, for this small subset the array provides a reliable assay of DNA methylation and can be effectively used to measure differential methylation in mice.


Introduction
There has been a surge in large-scale epigenetic studies in recent years. In particular, epigenome-wide association studies (EWAS) of DNA methylation have shown associations with physiological traits [1,2], diseases [3][4][5], environmental exposures [6,7], aging [8], and even socioeconomic [9] and emotional experiences [10]. The development of robust and reliable methylation microarrays has been an important driving force. In particular, the Illumina Human Methylation BeadChips have made it both convenient and cost-effective to incorporate an epigenetic arm to large epidemiological studies [11,12]. The latest version, the Illumina Infinium MethylationEPIC BeadChip (EPIC), provides an efficient high throughput platform to quantify methylation at 866,836 CpG sites on the human genome [13,14]. A remarkable biological insight that has emerged from these array-based studies is the definition of the methylation-based "epigenetic clock," a biomarker of human age and aging (i.e., the epigenetic clock) that is defined using specific probes represented on these arrays [8].
Currently there is no equivalent microarray platform for model organisms and work in experimental species have largely relied on high-throughput sequencing. For instance, while the human DNA methylation age can be calculated from a few hundred probes on the Illumina BeadChips, a similar effort in mice required a more extensive sequencing of the mouse methylome [15]. However, CpG islands (CGIs) are largely conserved between mice and humans and the two species share similar numbers of CGIs and similar proportions of CGIs in promoter regions of genes [16]. Considering that these CpGs and CGIs are highly conserved in gene regulatory regions, it is feasible that probes on the human microarrays that target these sites may have some application in research using rodent models. This was previously evaluated for the two older versions of the Illumina HumanMethylation BeadChips [17]. A more recent study has also evaluated the EPIC array for conserved probes [18]. These studies have shown that a subset of the probes target highly conserved sites and can be used to measure DNA methylation in mice and possibly other mammalian species.
In the present work, we extend the conservation analysis of the EPIC platform by applying a quantitative approach to evaluate the capacity of these probes to detect differential methylation in mice. We begin by defining the conserved probes and the key features of the corresponding CpG sites in the context of the larger mouse and human genomes. We also compare the methylation signal detected by the conserved probes with affinity-based methyl-CpG enriched DNA sequence (MBD-seq) data from the same samples and evaluate if the conserved probes are informative of age and strain differences in mice. Additionally, we perform comparison with a publicly available mouse CpG methylation data generated by reduced representation bisulfite sequencing (RRBS).

Defining conserved EPIC probes
Sequences for the 866,836 CpG probes were obtained from Illumina (http://www.illumina. com/). The probe sequences were aligned to the mouse genome (mm10) using bowtie2 (version 2.2.6) with standard default parameters. A total of 34,981 probes aligned to the mouse genome of varying alignment quality. Conserved probes were then defined based on quality of alignment. For this, we filtered out all sequences with a low mapping quality (MAPQ) of less than 60 (15,717 excluded) and those that contain more than two non-matching base pairs (1,092). To retain only the high-quality probes, we further filtered probes based on confidence in DNA methylation signal and based on this, 4,507 probes with detection p-values > 0.0001 were removed. This generated a list of 13,665 high quality probes that are conserved sequences and provide reliable methylation assays in mice (these are listed in S1 Data). CpG island annotations [19] for the respective genome were downloaded from UCSC Genome Browser (http:// genome.ucsc.edu) and distribution of conserved probes and positions of CGIs were plotted to the human (GRCh37) and mouse (mm10) genomes using CIRCOS [20].
For conserved sequences, there is high correspondence in functional and genomic features between mouse and human genomes and we referred to the human probe annotations provided by Illumina to define the location of conserved probes with respect to gene features and CpG context (i.e., islands, shores, shelves) (S1 Data). To evaluate if the conserved set is enriched in specific features relative to the full background set, we performed a hypergeometric test using the phyper function in R.

Animals and sample preparation
Tissues samples were derived from mice that were part of an aging cohort maintained at the University of Tennessee Health Science Center (PI: Robert W. Williams). Details on animal rearing and sample collection are described in Mozhui and Pandey 2017 [21]. All animal procedures were approved by the Institutional Animal Care and Use Committee (IACUC) at the University of Tennessee Health Science Center.
Liver tissues were collected from mice aged at~4 months (mos; young),~12 mos (mid), and~24 mos (old). The mice were of two different strains-C57BL/6J (B6) and DBA/2J (D2) -and as the colony was set up to study aging in females, the majority of the mice in this study are females ( Table 1). Mice were euthanized by intraperitoneal injection of Avertin (250 to 500 mg/kg of a 20 mg/ml solution), followed by cardiac puncture and exsanguination. All sample collection procedures were done on the same day within a 3-hour timeframe. Liver samples were snap-frozen and stored at -80˚C until use.
DNA was purified from the liver tissue using the Qiagen AllPrep kit (http://www.qiagen. com) on the QIAcube system. Nucleic acid quality was checked using a NanoDrop spectrophotometer (http://www.nanodrop.com). As reference, we also included two human samples. These are DNA derived from the buffy coats from two individuals.

DNA methylation microarray and data processing
DNA methylation assays were performed as per the standard manufacturer's protocol (http:// www.illumina.com/). In brief, 500 ng of DNA extracted from the mouse liver was treated with sodium bisulfite to convert cytosine to uracil. The 5-methyl cytosine remains unreactive to sodium bisulfite. The DNA is then hybridized to the EPIC BeadChip. After washing off unhybridized DNA, a single base extension was recorded to calculate the methylation level at the CpG probe site. DNA methylation assays were performed at the Genomic Services Lab at the HudsonAlpha Institute for Biotechnology (http://hudsonalpha.org). Raw intensity data files (idat files) for both mouse and human samples were processed using the R package Minfi [22]. The full mouse data is available from NCBI NIH Gene Expression Omnibus (GEO accession ID GSE110600). The intensity and β-values were used to evaluate the performance of the EPIC probes in mice and humans. Comparisons were based on the full set of 850K probes and the conserved set of 13,665 probes. We also used the β-values and signal intensity scores for the 13,665 probes to perform hierarchical clustering and principal component analysis for the mouse samples. From initial quality checks, we identified one outlier mouse sample (S1 Fig) that had lower intensity and higher detection p-value compared to the other mouse samples. This sample was excluded from the statistical tests.

Comparison with high-throughput sequencing data
The mouse samples we report here were previously assayed for DNA methylation using MBDseq [21]. This is an affinity-based enrichment of methylated CpGs using the methyl binding domain (MBD) of methyl-CpG-binding protein 2, followed by high throughput sequencing (MBD-seq) [23][24][25]. Sequencing was performed on Life Technologies' Ion Proton platform. Data have been deposited to the NCBI's Gene Expression Omnibus (https://www.ncbi.nlm. nih.gov/geo/; GEO accession ID GSE95361) and Sequence Repository Archive (https://www. ncbi.nlm.nih.gov/sra/; SRA accession ID SRP100703). To compare methylation signal detected by the conserved EPIC arrays, we extracted MBD-seq reads at the corresponding sites. MBDseq does not provide single-base resolution as the resolution is limited to the fragment size, in this case~300 bp. However, since methylation levels at neighboring CpGs are largely correlated [26], we derived quantitative data from the number of read fragments that map to a CpG region. For the sites in the mouse genome targeted by the conserved EPIC probes, we expanded the window to 300 bp bins, and extracted the MBD-seq fragment counts. The CpG density-normalized methylation level was then quantified using the MEDIPS R package [27]. We then used Pearson's correlation to compare the EPIC β-values and the relative methylation score (rms or the CpG density normalized methylation) detected by MBD-seq [28].
For additional comparison, we used a publicly available mouse RRBS data (GEO accession ID GSE93957, sample GSM2465617 at http://www.ncbi.nlm.nih.gov/geo/). This data was generated from liver tissue of mouse strain C57BL/6-BABR and alignment was to the GRCm38/ mm10 mouse genome build [15]. By matching genome coordinates, we identified CpGs that were interrogated by both conserved EPIC probes and the RRBS. From the RRBS data, we used the methylation percentage (counts for methylated/unmethylated) to correlate with the β-values.

Analysis of differential methylation
Statistical analyses were done in R (https://www.r-project.org/) and JMP Statistics (JMP Pro 12). Mice were grouped into three age categories (young, mid, and old; additional sample details are in Table 1). To evaluate differential methylation detected by the 13,655 conserved probes, we applied a regression model with age, strain and sex as predictors (~ageGroups + strain + sex) for each probe using the R glm function and type III anova to calculate test statistics (equations are provided in S1 Data). For the MBD-seq reads, we performed differential methylation analysis of the read counts using the edgeR R package [29]. The same linear regression model was applied (~ageGroups + strain + sex) and equations are provided in S1 Data. We then cross-compared differential methylation detected by the two methods. Treating the EPIC data as a discovery set, we applied the Benjamini-Hochberg (BH) procedure to control the false discovery rate (FDR) [30,31]. We then defined differentially methylated CpGs (DMCpGs) and evaluated the corresponding region in the MBD-seq data to test replication at a lenient uncorrected p-value threshold of 0.05. Likewise, in the reverse comparison, we applied an FDR threshold to identify differentially methylated regions (DMRs) in the MBD-seq data, and tested replication of the corresponding CpG at an uncorrected p-value threshold of 0.05.

Conserved Infinium MethylationEPIC probes
The human EPIC array contains 866,836 50-mer probes. Out of these, we defined a total of 13,665 probes that align to conserved sites in the mouse genome and provide high quality methylation signal (details on mapping quality scores and methylation signal confidence are provided in S1 Data). In the full set of EPIC probes, 71% are located within annotated gene features or within 200-1,500 bp upstream of transcription start sites (TSS). Compared to this background set, a higher percent of the conserved probes (88%; 11,972 probes) target such functionally annotated regions. Probes that target CpGs located in exons, 5' UTR, and within 200 bp upstream of TSS (TSS200) are highly overrepresented among the conserved set ( Table 2). This is expected, since sequences in these functional regions are conserved across species. The upstream regulatory regions and the first exon harbor a large percent of CGIs, and compared to the background set, there is close to a 2.5-fold higher enrichment in CGIs among the conserved probes ( Table 2). In contrast, there is no enrichment in probes that target CpGs that are between 200-1,500 bp upstream of TSS (TSS1500), gene body (mostly intronic), 3' UTRs, and non-genic regions. Locations of the conserved probes and CGI densities in the human and mouse genomes are shown in Fig 1.

Comparison of probe performance in mouse and human samples
We used data generated from two human samples as a reference. Using the full set of 850K probes, the mouse samples showed low overall signal intensity (Fig 2A). The mean signal intensity for the two human samples was 9,118 ± 2,192 ( Table 1). For the mouse samples, the mean signal intensity was 810 ± 114 ( Table 1). The β-value distribution also showed poor performance for mice with a peak β-value at 0.4 that indicates failure for probes. The methylation β-values in human samples showed the expected bimodal distribution that characterizes the Illumina methylation arrays (Fig 2B) [13,14]. The EPIC BeadChip clearly performed poorly in mice when we considered the full set of probes. However, when we considered only the 13,665 conserved probes, the methylation signal became comparable between the mouse and human samples. Total mean signal intensity for the mouse samples ranged from 9,866 to 11,545 (Mouse1, which failed the initial QC, has very low signal intensity compared to the other mouse samples; this was excluded from differential methylation analysis) ( Table 1). Mean signal intensity for the two human samples were 8,711 and 11,761 ( Table 1). The bimodal β distribution was also observed for this set of conserved probes in mouse samples (Fig 2C and 2D).

Comparison with MBD-seq and RRBS
To determine if we could find a concordant methylation signal, we compared the microarray β-values with the CpG density-normalized rms derived from MBD-seq data (average β-values and rms are provided in S1 Data). As in the case of the EPIC, the MBD-seq data also showed   Correlation between MethylationEPIC and high-throughput sequencing data. Two sequencing datasets were considered for comparison. The MBD-seq was generated in-house from the same samples as the microarray data, and the mouse liver reduced representation bisulfite sequencing (RRBS) data was from a public repository. For each of the 13,665 conserved probes, the 300 bp window around the corresponding CpG was determined and the CpG density-normalized relative methylation score (rms) was estimated for that region from the MBD-seq data. (A) While there was overall significant correlation between the β-values and rms (Pearson's correlation of 0.70, p < 0.0001), for CpGs with low β-values, the corresponding regions showed rms that cluster close to 0, and for CpG with high β-values, the corresponding rms tended to cluster close to 0.75. (B) For the RRBS data, 2,548 CpGs intersected with the list of 13,665 CpGs interrogated by the conserved EPIC probes. For most CpGs, methylation levels were concordant between the RRBS and EPIC (R = 0.78, p < 0.0001). However, sequence coverage was low for several CpGs in the RRBS data (average total read counts of <7) and such CpGs were associated with estimated methylation fraction of 0 or 1. the bimodal distribution. Overall, there was concordance between the two technologies, and the β-values and rms were significantly correlated (Pearson's correlation of 0.70, p < 0.0001; Fig 3A). However, several CpGs also showed discrepant signal between the two technologies (i.e., CpGs with low β-values associated with high rms and vice versa). To assess the number of probes associated with concordant or discordant methylation levels, we grouped the CpGs into three categories based on β-values-hypomethylated for β < 0.3, hemimethylated for 0.3 β 0.7, and hypermethylated for β > 0.7-and examined the corresponding rms values. Given the high representation of islands and CpGs in 5' regions of genes, which generally remain hypomethylated [16,32], the majority of the conserved probes fell into the hypomethylated category ( Table 3). For the hypomethylated probes, 82% of the corresponding CpG regions also had rms < 0.3 ( Table 3). For many of the CpGs regions that correspond to the hypomethylated probes, the rms were close to 0, which indicates poor coverage by MBD-seq. For hemimethylated probes, 58% of the corresponding regions had 0.3 rms 0.7 and 31% had rms < 0.3. For hypermethylated probes, only 40% of corresponding regions were associated with rms > 0.7, and 54% had 0.3 rms .7. The corresponding CpG regions for this hypermethylated set tended to have rms close to 0.75. This clustered rms distribution for CpG regions at the lower and upper levels of methylation indicate that the MBD-seq has lower quantitative sensitivity at these regions.
For comparison with a bisulfite-based assay, we obtained RRBS data for mouse liver. We found only 2,548 CpGs in the RRBS data that were also interrogated by the conserved EPIC probes. Using this smaller subset, we compared the array-based β-values with the RRBS-based methylation percent (here represented as fraction methylated; Fig 3B). Overall, there is significant correlation between the two data (R = 0.78, p < 0.0001) and this is particularly true for CpGs with methylation percent that range between 0 and 100 in the RRBS. However, comparison with the RRBS was limited by the poor read coverage for several of the CpGs that resulted in either 0% or 100% methylation values. For these CpGs that clustered at either 0 or 100%, the average total read counts was less than 7. For this 2,548 CpGs, we found a better linear correlation between the β-values and the MBD-seq (R = 0.84, p < 0.0001; Fig 3C). Overall, the significant correlations with both the MBD-seq and RRBS shows that the conserved EPIC probes provide a reliable quantification of methylation in mice for majority of the CpGs. Furthermore, for CpGs that are hypomethylated or hypermethylated, the EPIC technology may have an advantage and provide higher quantitative sensitivity compared to the. MBD-seq.

Differential methylation analysis
We applied linear regression to examine differential methylation by age group and strain, and cross-referenced the DMCpGs detected by the EPIC array with DMRs detected by MBD-seq. For the effect of age, no conserved EPIC probe passed a 10% FDR threshold (full results and p- values are provided in S1 Data). However, we note that the probe that detected the most significant effect of age, cg13269407, is among the 353 CpGs that are used to estimate the human epigenetic age [8]. This CpG is hemimethylated (average β-value of 0.55) and associated with ã 2.4-fold decline in methylation between young and old age (uncorrected p-value = 1.8 x 10 −5 ). In the MBD-seq, the corresponding region is classified as hypomethylated with rms = 0 for most of the samples and no reliable statistics could be carried out for this region due to small number of mapped reads. We then performed a reverse comparison to identify agedependent DMRs (age-DMRs) in the MBD-seq data and evaluated replication by the EPIC probes. At the same FDR threshold of 10%, the MBD-seq detected seven age-DMRs. These strong age-DMRs have rms between 0.3 and 0.7 and are associated with an increase in methylation with age. Most occur in CGIs that have been reported previously [21]. Out of these seven age-DMRs, six corresponding EPIC probes replicated the age-dependent increase in methylation at a nominal p-value cutoff of 0.05 ( Table 4).
For strain effect, 219 conserved EPIC probes detected a significant difference in methylation between B6 and D2 at an FDR threshold of 10% (strain-DMCpGs). Close to 80% of these CpGs (175 out of 219) are associated with higher methylation in D2 relative to B6. In the MBD-seq data, only 29 of the 219 corresponding regions replicated strain effect at an uncorrected p-value cutoff 0.05 ( Table 5). Of these, 9 were associated with higher methylation in B6, and 20 were associated with higher methylation in D2. In the reverse comparison, we identified only 37 strain-dependent DMRs (strain-DMRs) at an FDR cutoff of 10%. Consistent with the EPIC data, the majority of these regions (21 of the 37) showed higher methylation in D2 relative to B6. Of these, 16 strain differences were replicated at the corresponding CpG in the EPIC data (6 with higher methylation in B6 and 10 with higher methylation in D2) ( Table 5).

Discussion
We used the recently released Illumina EPIC microarray to assay DNA methylation at conserved CpGs in the mouse genome. We evaluated both the qualitative features as well as the quantitative performance and compared it with MBD-seq data that was generated on the same DNA samples from mice. Such a cross-species approach has been previously used to examine gene expression and perform comparative genomics studies [33][34][35][36]. The two older versions Cross-species utility of human DNA methylation microarray These are strain-dependent differentially methylated CpGs (EPIC microarray) and CpG regions (MBD-seq) based on a "false discovery threshold" (FDR) cutoff of 10% and replication at an uncorrected p-value threshold of 0.05. 2 Coef. is the linear regression coefficient (i.e., difference in methylation relative to C57BL/6J; negative is lower methylation in DBA/2J; and positive is higher methylation in DBA/2J compared to C57BL/6J). LogFC is log 2 fold difference in methylation (i.e., difference in methylation relative to DBA/2J; negative is lower methylation in DBA/2J; and positive is higher methylation in DBA/2J compared to C57BL/6J). Cross-species utility of human DNA methylation microarray of this Illumina methylation microarrays, the Infinium HumanMethylation 27K (HM27) and HumanMethylation 450K (HM450), have been carefully evaluated for use in mice [17]. The number of probes that map to the mouse genome can vary somewhat depending on the alignment algorithm. In the work by Wong et al. [17], alignment to the bisulfite-converted mouse genome resulted in the highest number of conserved probes. Using a stringent parameter of 100% sequence identity to the bisulfite genome, Wong et al. identified a total of 1,308 (4.7% of total) uniquely aligned probes in the 27K array, and 13,715 (2.8% of total) uniquely aligned probes in the 450K array that can be used to interrogate conserved CpGs in the mouse. In our present work, we performed alignment in a non-bisulfite space. While we required unique alignment, we tolerated up to two non-matching base pairs and added detection confidence as another parameter to identify probes that we can use for reliable quantitative assays. With these parameters, we identified 1.6% of total probes (13,665 in the EPIC array) that aligned uniquely to the mouse genome and associated with high confidence in signal detection. While alignment to the bisulfite-converted genome may have yielded a higher number of probes for measuring DNA methylation in mouse, the degenerate nature of bisulfite conversion would capture probes with off-target degenerate alignments. Indeed, a recent study did find this with a number of uniquely aligned probes ranging from 4,984 to 19,420 depending on the mapping stringency [18]. When we compared our list of probes to the 19,420 mouse EPIC probes reported by Needhamsen et al. [18], we found a high overlap of 77%. For our purposes, the probes we have identified here provide a representative subset with high confidence in sequence specificity and conservation in mouse and we have used these to assess quantitative performance in mouse samples and utility in detecting methylation variation.
In the set of 13,665 conserved probes, 9,429 (69%) were CpG loci carried over from the HM450 array and 7,483 of these were also in Wong's list of conserved HM450 probes [17]. Only 4,234 of the 13,665 probes (31%) were new content that are unique to the EPIC array (i.e., not ported over from the HM450). A similar proportion of probes in the set reported by Needhamsen et al. was also ported over from the HM450 (13,005 out of 19,420) [18]. This low proportion of conserved probes is likely due to the design of the EPIC array. In the case of the HM450, the emphasis was on CGIs and flanking regions (i.e., shores and shelves) [11,37]. These CGIs generally overlap proximal regulatory sites and are highly conserved across mammalian species with humans and mice having very similar complement of CGIs [16,32,38]. In contrast to the HM450, the emphasis of the EPIC array was on enhancers and CpGs outside of islands, and these are sequences that generally have lower conservation across mammalian species [13,39]. Based on Illumina probe annotations, only 22% of the newly added content unique to the EPIC cover CGI associated regions. Out of the 4,234 conserved probes we identified that are unique to EPIC, 1,767 target CGI associated regions and 1,573 target 5' regions such as TSS, 5' UTR and exon 1. This is consistent with the overall higher enrichment in proximal gene regulatory sites among the conserved probes. Our observations show that despite the higher probe content in the EPIC compared to the older HM450, the number of probes with utility in cross-species studies is not proportionally increased.
In terms of quantitative variation in methylation, CGIs and promoter region CpGs show significant population variation [40]. However, compared to intergenic CpGs, the extent of inter-individual variability in methylation is reported to be much lower in these conserved sites [41,42]. Hence, an obvious limitation in using the conserved EPIC probes is that we attain only a narrow perspective of the mouse methylome and we may be sampling the portion of CpGs that shows the least quantitative variability in a population. Nonetheless, CpGs in regulatory regions and CGIs play crucial roles in development and cell differentiation, and are implicated in tumor development and aging [16,32,38,43,44]. While narrow in perspective, the conserved probes likely represent a subset of CpGs with high functional relevance and application in cross-species study of DNA methylation.
We compared the methylation signal detected by the EPIC probes to two datasets generated by DNA sequencing-MBD-seq that was measured using the same samples as the EPIC, and a publicly available mouse liver RRBS data. While the sequencing approach theoretically provides more comprehensive coverage than microarrays, both RRBS and MBD-seq come with their own characteristic biases and have limitations in the types of CpGs that are most effectively interrogated [26,45,46]. The RRBS data we used in this comparison had poor read coverage for a large proportion of CpGs. Nonetheless, an overall significant correlation between the β-values and methylation percent measured by RRBS was observed. Unlike the bisulfite based RRBS and EPIC, the MBD-seq relies on affinity capture of DNA fragments by the methyl-CpG binding domain protein [23][24][25]. The methylation level is indirectly estimated based on the counts of sequenced reads that map to that region and the MBD-seq provides information on the methylation level of correlated CpGs in a region rather than one CpG [47][48][49]. We found a stronger concordance between the EPIC and MBD-seq, likely because these were generated from matched samples. However, for CpGs that are hypomethylated and hypermethylated, the rms for the corresponding regions showed a more clustered distribution and indicated a limited quantitative sensitivity for MBD-seq and limited capacity in discerning quantitative variation at such CpG regions. Our observations agree with a previous study that compared HM450 and MBD-seq data generated using the same commercial kit we used [50].
For a direct comparison between the EPIC probes and MBD-seq, we applied the same regression model and crosschecked the DMCpGs and DMRs detected by the two technologies. While we expected a higher quantitative sensitivity for the EPIC probes, the EPIC probes did not detect significant differential methylation between age groups at an FDR threshold of 10%. However, the topmost significant probe, cg13269407, is part of the 353 clock CpGs that are used to estimate the human DNA methylation age [8]. Consistent with the negative correlation with age in humans, this age-informative CpG was associated with a~2.4-fold reduction in methylation in the old mice relative to the young mice. Aside from cg13269407, only 10 other human clock CpG probes were in the conserved set and none of these are associated with age in mice. Overall, the effect of age was weak when we considered individual CpGs. When we examined the corresponding CpG regions, the MBD-seq was more effective at detecting agedependent methylation. At an FDR cutoff of 10%, we identified seven CpG regions that are classified as age-DMRs. These age-DMRs have been previously reported and show increases in methylation with age in mice [21]. For these age-DMRs identified by MBD-seq, we then checked whether the EPIC probes could verify the age effect. For this cross-verification, we used a less stringent statistical threshold of 0.05 for uncorrected p-values and found that six of the targeted CpGs are also associated with a significant age-dependent increases in β-values. Our observations suggest that age-dependent changes in methylation at these conserved sites may be more pronounced if we consider the correlated change of neighboring CpGs rather than methylation status of a single CpG. Despite the low overall quantitative sensitivity, the MBD-seq provides a complementary approach that may perform better for detecting methylation changes in regions harboring multiple correlated CpGs.
DNA methylation can vary substantially between mouse strains and a large fraction of this is likely due to underlying sequence differences between strains [21,51,52]. Strain variation in methylation has been shown to associate with complex phenotypes in mice such as insulin resistance, adiposity, and blood cell counts [53]. In our analysis, we detected 219 CpGs (i.e., 1.6% of the 13,365 interrogated CpGs) with a significant difference between strains at an FDR cutoff of 10%. A large majority (175 out of 219 CpGs) was associated with higher methylation in D2 compared to B6. While the overall lower methylation in B6 is intriguing, such variation between strains must be cautiously interpreted. It is well known that SNPs in probe sequences can have a strong confounding effect. This is particularly pernicious for mouse specific microarrays in which probe sequences are usually based on the B6 mouse reference, and as a result, there is more efficient hybridization for B6-derived samples, which results in a positive bias for this canonical mouse strain [54][55][56]. In the present work, since the EPIC array is based on the human sequence, we do not expect a systematic bias for one strain over the other. For replication, we referred to the MBD-seq data and only 29 out of the 219 corresponding CpG regions had consistent differential methylation between B6 and D2 in the MBD-seq.
Unlike using a human array that should not bias one mouse strain over another, the MBDseq data is more vulnerable to technical artifacts caused by sequence differences. As is the general practice, we performed the alignment of the MBD-seq reads to the mouse reference genome. This means the alignment will be more efficient for sequences from B6, while sequences from D2 will have more mismatches. Since methylation quantification is estimated from the relative number of aligned reads, this may result in a systematic negative bias for D2, and methylation levels in regions with sequence differences will tend to have lower methylation due to poorer alignment. As a result, a higher fraction of strain-DMR will have lower methylation in D2 compared to B6 [21]. In the case that these conserved CpGs have higher methylation in D2 compared to B6, then the negative bias will lessen the quantitative difference between the strains. This may explain why the effect of strain is less pronounced in the MBD-seq data. In the MBD-seq, there were only 37 DMRs between B6 and D2 at an FDR threshold of 10%, and the EPIC probes replicated 16 of these. Out of the 37 strain-DMRs, the majority (21 of the 37) was associated with higher methylation in D2. Both the EPIC and MBD-seq therefore show an overall lower methylation profile in B6 compared to D2 that warrants further investigation and verification. Such strain differences in overall methylation have been previously reported for A/J and WSB/EiJ, with the A/J strain exhibiting higher methylation of CGIs in normal liver tissue compared to WSB/EiJ. This difference in the methylome was suggested to contribute to differential susceptibility for nonalcoholic fatty liver disease that characterizes the two strains [51]. In the case of B6 and D2, the two strains are highly divergent in a number of complex phenotypes ranging from behavioral and physiological to aging traits. The panel of recombinant inbred progeny derived from B6 and D2 (the BXD panel) has been used extensively in genetic research [57][58][59][60][61]. If there is indeed a distinct profile in DNA methylation between B6 and D2, then it will be of interest to evaluate if it segregates in the BXDs and how the methylome contributes to some of the phenotypic differences. The BXD panel could be an extremely rich and as yet untapped resource for methylome-wide analysis of complex traits that can then be integrated with the extensive systems genetics work that has already been done with this mouse family [62,63]. No doubt, large-scale analysis of genome-wide DNA methylation in mouse genetic reference panels will be greatly accelerated with the development of a mouse version of the Infinium methylation arrays. And as is the case with other types of arrays, it will be crucial that the probes are designed against a more diverse panel of strains so that investigators can derive a more unbiased readout of methylation [64].
To conclude, we have catalogued a small subset of EPIC probes that target conserved CpGs in the mouse genome and that provide reliable quantification of DNA methylation in mouse samples. While detection for age-dependent methylation was weaker for the EPIC probes compared to MBD-seq, we have identified significant strain variation in methylation at the conserved CpGs. Our results indicate lower methylation for B6 compared to D2 at sites that have significant strain effect. It is unclear how much of the strain variation results from underlying sequence differences between B6 and D2, and this strain-specific profile needs to be further evaluated and verified Supporting information S1 Data. Information on 13665 conserved EPIC probes and results on differential methylation analysis (EPIC vs MBD-seq). (XLSX) S1 Fig. Quality check for samples using β-values. (A) Hierarchical clustering of β-values for the 13665 conserve probes on the Illumina Infinium MethylationEPIC shows a clear separation between the mouse and human samples. For the mouse samples, DBA/2J and C57BL/6J samples group separately. Mouse sample 1 (M1) is an outlier and has low average signal intensity compared to the other mouse samples. (B) Principal component analysis was performed for the 11 mouse samples using β-values for the 13665 conserve probes. A scatter plot of the first two principal components, PC1 and PC2, clearly demonstrate the outlier status of mouse sample 1 (arrow) and this sample was excluded in the differential methylation analyses. (DOCX)