DNA methylation is vital for many essential biological processes and human diseases. Illumina Infinium HumanMethylation450 Beadchip is a recently developed platform studying genome-wide DNA methylation state on more than 480,000 CpG sites and a few CHG sites with high data quality. To analyze the data of this promising platform, we developed FastDMA which can be used to identify significantly differentially methylated probes. Besides single probe analysis, FastDMA can also do region-based analysis for identifying the differentially methylated region (DMRs). A uniformed statistical model, analysis of covariance (ANCOVA), is used to achieve all the analyses in FastDMA. We apply FastDMA on three large-scale DNA methylation datasets from The Cancer Genome Atlas (TCGA) and find many differentially methylated genomic sites in different types of cancer. On the testing datasets, FastDMA shows much higher computational efficiency than current tools. FastDMA can benefit the data analyses of large-scale DNA methylation studies with an integrative pipeline and a high computational efficiency. The software is freely available via http://bioinfo.au.tsinghua.edu.cn/software/fastdma/.
Citation: Wu D, Gu J, Zhang MQ (2013) FastDMA: An Infinium HumanMethylation450 Beadchip Analyzer. PLoS ONE 8(9): e74275. doi:10.1371/journal.pone.0074275
Editor: Kai Wang, University of Southern California, United States of America
Received: March 29, 2013; Accepted: July 30, 2013; Published: September 5, 2013
Copyright: © 2013 Wu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work is supported by National Basic Research Program of China 2012CB316503 and National Natural Science Foundation of China Nos. 91019016, 31061160497 and 61005040. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
DNA methylation is an important type of epigenetic modification in the human genome. The function of DNA methylation is traditionally described as silencing mark . Recent studies suggest DNA methylation plays more sophisticated roles in chromatin structure and transcriptional regulation . DNA methylation is highly dynamic but under strict control during development , . While in human diseases, especially in cancer, the DNA methylation states are usually significantly disrupted and those changes are strongly associated with cancer hallmarks –.
There are four popular techniques for detecting genome-wide DNA methylation state: whole genome bisulphite sequencing , methylation array , reduced representation bisulfite sequencing  and enrichment based method . Infinium HumanMethylation450 Beadchip is a recently developed methylation array platform which detects more than 480,000 cytosine sites along the entire human genome. It covers the majority of reference genes and shows high data reproducibility between technical replicates. Because of the lower cost and easier experimental protocol, this platform is suggested to be suitable for large-scale studies . The Cancer Genome Atlas (TCGA) program, which aims to providing a comprehensive molecular portraits of all types of cancer, has already used this platform to profile the DNA methylation states of hundreds of clinical samples , .
Here we propose our recently developed software, FastDMA, to help researchers to analyze the data generated by this platform, especially for large or clinical datasets. FastDMA uses a unified statistical model, analysis of covariance (ANCOVA), to do both single probe analysis and differentially methylated region (DMR) scanning. Technically, FastDMA is implemented as a standalone software in C++ which can be easily distributed and further developed. Besides, by using parallel computing technique, it can deal with large-scale datasets with very high computational efficiency.
This article is organized as follows: in the Methods section, we described the workflow, the computational model and the software implementation of FastDMA. In the Results section, we first applied FastDMA on three large-scale datasets from TCGA of breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), and prostate adenocarcinoma (PRAD). And then, we compared FastDMA with a recently published software IMA  for both the correctness and the computational performance. Finally, we discussed the major advantages of the software and its limitations waiting for further developments.
The workflow of FastDMA is shown in Figure 1. FastDMA can do data normalization and support both single probe and region based data analyses. The input of FastDMA is usually a data matrix, processed from the original fluorescence signal, containing the beta value (a value indicating the DNA methylation level, between 0~1) and the detection pvalue (a value indicating whether the signal is believable or not) of each probe on the beadchip. For the outputs, FastDMA generated human-readable table-limited text files and the formatted BED files compatible for UCSC Genome Browser visualization.
Except for the case-control comparison, multiple-group comparison is usually required. For example, it is required to compare the DNA methylation levels in three groups if we want to identify the differentially methylated sites among normal, primary tumor, and metastasis samples. Besides, other clinical co-variables (such as sex, age, etc.) rather than the group labels should also be considered. To deal with these issues in a unified framework, FastDMA used ANCOVA as the main statistical model, a generalized linear model combining ANOVA and regression.
For clearly presenting the statistical model, we denoted some variables as follows. The dataset contains n groups, Gi, (1≤i≤n), to be compared. For i-th group Gi, there are mi samples, Sij (1≤j≤mi). The k-th probe of sample Sij has a methylation level (as beta-value) βijk. Clinical information (such as sex, age, etc.) is treated as the co-variable: Cijl stands for the value of the l-th co-variable of sample Sij. Gaussian noise is assumed and denoted as e. FastDMA rejects the data points with detection pvalues greater than 0.05 (unreliable data points).
The normalization task assumes that all the studied samples have the same overall beta value distribution. FastDMA uses quantile normalization : first, calculate the quantiles of beta values (logit transformed) in all the samples; then, calculate the averages of the quantiles; and finally, piecewise linear transformation is applied on each beta value such that every sample has the same quantiles as the average. Users can do normalization with any other methods by themselves and skip this step.
Single Probe Analysis
The first task is to identify the significantly differentially methylated probes. The null hypothesis is that if a probe is not differentially methylated, the beta values of that probe have the same mean values among all the groups. Otherwise, different means among the groups are required. ANCOVA is used in this task to compare two linear regression models that one assumes an overall mean across all the studied groups while the other assumes different means. For the k-th probe of sample Sij belonging to group Gi, we have:
Cijl stands for the value of the l-th co-variable of sample Sij. The middle terms in both equations are the effect of the co-variables. It is assumed to be linear.
ANCOVA compares these two linear regression models and gives a pvalue indicating to what extent one can reject the null hypothesis. Then, false discovery rate (FDR) is calculated according to Benjamini–Hochberg (BH) procedure. The pvalue and FDR thresholds to call significantly differentially methylated probes can be manually adjusted by users. To facilitate the following functional analysis, the probes are annotated (such as nearby genes) using a well-curated database IlluminaHumanMethylation450k.db (Tim Triche and Jr. R package version 1.4.6.).
DMR Analysis in Predefined Regions
Single differentially methylated probe may be hard to be interpreted biologically. FastDMA provides region-based analysis, which can provide additional information. For a predefined region (such as a promoter or a CpG island) containing several probes, if the region is uniformly methylated among all the groups, then it is assumed that the beta values of every probe in that region are distributed with the same means independent with their group labels. Otherwise, the region is called to be differentially methylated. For the n-th region, there are r probes located in it denoted as P1, P2,…, Pr. The beta value of the probe Pk of the sample Sij is denoted as to emphasize that the probe belongs to the n-th region.
Similar to the comparison of the two models in the single probe analysis, ANCOVA can compare the above two series of models and calculate the pvalues to evaluate their differences fitting the same data. And also, FDR is calculated using BH procedure.
DMR Analysis in Arbitrary Regions
In the second task, users should predefine the regions. FastDMA can also scan all the regions covered by probes in the whole genome. The idea to find arbitrary DMRs is to use a sliding window across the genome and test whether each sliding window is a DMR using the above statistical model. Then the overlapped DMRs are merged to give the final results. The default window size is set as 200 bp and only the windows with more than 2 probes are used in the analysis.
It is time-consuming to fit the models in ANCOVA of all the sliding windows across the whole genome. FastDMA provides another intuitive way to deal with this task: it directly uses the FDRs of the probes in a sliding window from the single probe analysis and simply calculates their geometric mean as the score of that region. Windows with this score less than 0.05 are treated as DMRs and the overlapped DMRs are combined as the final results.
The computerized procedures are coded in C++. An additional script for preprocessing the format of the data is coded in Python. The software follows the GNU general public license version 3 for academic use and can be downloaded via the site http://bioinfo.au.tsinghua.edu.cn/software/fastdma. Both a simplified procedure and the advanced user manual to install and run FastDMA are provided on the website.
Datasets and Experiments
We applied FastDMA on three large-scale Infinium HumanMethylation450 Beadchip datasets from TCGA of BRCA, LUAD, and PRAD. The numbers of samples for the datasets are 120, 69, and 225 respectively.
Many Probes are Found Differentially Methylated in Tumor
FastDMA was used to calculate the pvalue and the absolute difference of the differential DNA methylation level of each probe. The results of the overall statistics are shown as volcano plots in figure 2. The top 10 tumor-hypermethylated probes are summarized in table 1. As we can see in figure 2, many probes are significantly differentially methylated. We set the threshold as pvalue<0.01 and FDR<0.05. FastDMA identifies about half of the probes as differentially methylated ones in BRCA and PRAD datasets. Although less than those two, the LUAD dataset still has more than one third of the probes identified as differentially methylated ones (detailed numbers are shown in table 2). These numbers agree with a recently published study on liver hepatocellular carcinoma . These results imply that the cancer genome is globally changed in the consideration of DNA methylation state.
Panel A, B, and C shows the result on BRCA, LUAD and PRAD, respectively. Each dot in the plot represents a beadchip probe. The vertical axis indicates the log-transformed pvalue of that probe calculated by ANCOVA and the horizontal axis indicates the mean difference between the methylation levels in tumor and normal samples (tumor - normal).
To illustrate the support for multiple-group and clinical co-variable, we constructed a small-scale testing dataset including the array data of four BRCA metastatic tumor samples and randomly chosen size-matched primary tumor samples and normal samples from TCGA. The age at initial pathologic diagnosis is treated as the co-variable. We ran FastDMA to find the differentially methylated sites across normal, tumor and metastatic groups. With the threshold pvalue<0.01 and FDR<0.05, 29 probes are identified as differentially methylated (this small number is mainly due to too few available samples).
DMR Analysis Helps to Find Genomic Regions for Further Study
The DMR scanning can be done either in predefined regions or in arbitrary regions across the whole genome using the same computational model. Here, we only reported the results by scanning arbitrary regions. The statistical significance thresholds were set the same as the single probe analysis (pvalue <0.01 and FDR <0.05). Under such settings, FastDMA founds 60877, 48930, and 56179 DMR candidates (mainly spanning several hundred base pairs) in the BRCA, LUAD, PRAD datasets respectively. Table 3 shows the state (tumor-hypermethylated or tumor-hypomethylated) and location (CpG island or non-CGI) of these DMR candidates. Figure 3 is a collection of IGV  screenshots showing several cases. All these cases are hypermethylated in cancer and lie in the 5′ region of a nearby gene (promoter region). These regions contain dense distribution of many differentially methylated probes and therefore can be interesting candidates for further functional or biomarker analysis.
All genomic regions shown in this figure are hypermethylated in cancer and lie in the 5′ region of a nearby gene. The green (normal) and red (tumor) vertical bars represent the methylation levels of the probes located in each region. The cyan horizontal bars show the differentially methylated regions identified by FastDMA. A, B are from BRCA dataset; C, D are from LUAD dataset; E, F are from PRAD dataset.
To validate our results, we randomly selected 20 candidates in the LUAD datasets and then perform validation according to another statistical framework suggested by Jaffe et al . All the 20 candidates have strong statistical significance (with pvalue less than 0.001).
FastDMA Finds Essentially the Same Differentially Methylated Probes as IMA does but Performs with much Higher Computational Efficiency
We compared FastDMA and IMA (under Student’s t-test mode) on the three cancer DNA methylation with single probe analysis. The thresholds for identifying differentially probes were set as pvalue <0.01 and FDR <0.05. Table 2 shows the numbers of identified probes by IMA and FastDMA under different contexts. It is shown that more than 98% identified probes are overlapped in FastDMA and IMA, which indicates the correctness.
To compare the computational efficiency, we ran them on a set of testing datasets (generated from real data of TCGA) with different data sizes varying from 40 to 220. FastDMA can work in parallel manner, we ran FastDMA twice using either single thread or four parallel threads. To measure the time consumption exactly, we repeatedly ran the software three times on every testing dataset and calculated the average time consumption with standard deviation. As indicated in figure 4, FastDMA works with much higher efficiency.
FastDMA was tested under both the single-thread and four-parallel-thread mode. A) The comparison on the average time consumption (calculated by repeatedly running the software three times). B) The comparison on the memory consumption.
As a promising platform to detect genome-wide DNA methylation state, Infinium HumanMethylation450 Beadchip begins to be widely used with relatively lower cost. FastDMA provides an integrative pipeline and a high computational efficiency solution for analyzing this kind of data. This software can benefit the community of DNA methylation, especially for the large-scale clinical study. FastDMA uses general input format. Except 450 K Beadchip, it can also be used to analyze the data generated from other platforms, such as the previous Infinium HumanMethylation27 Beadchip and the possibly future Infinium beadchips with the same technique.
It should be noted that FastDMA is not elaborately designed for the data normalization. Users can use any other sophisticated normalization procedure (such as in ref. ) rather than the simply quantile normalization if this step is crucial for the following data analysis. Serious concerns about DNA methylation array are raised that the platform may contain potential technical artifacts . To solve this problem, we design a command line parameter to filter out probes with genomic variants or co-hybridizing according to a recent work . Also, any biological interpretations for the differential methylated probes should be made carefully.
FastDMA can smoothly run under major Linux/Unix systems using command line. For easier visualization of the results, the software formats the outputs as widely-used BED files that can be directly loaded into any genome browser tools. In the future, we will develop a graphic user interface for boarder users with less computer background.
Conceived and designed the experiments: JG MZ. Performed the experiments: DW. Analyzed the data: DW. Wrote the paper: DW JG.
- 1. Holliday R, Pugh JE (1975) DNA modification mechanisms and gene activity during development. Science 187: 226–232. doi: 10.1126/science.1111098
- 2. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet 13: 484–492 doi:10.1038/nrg3230.
- 3. Numata S, Ye T, Hyde TM, Guitart-Navarro X, Tao R, et al. (2012) DNA methylation signatures in development and aging of the human prefrontal cortex. Am J Hum Genet 90: 260–272 doi:10.1016/j.ajhg.2011.12.020.
- 4. Shearstone JR, Pop R, Bock C, Boyle P, Meissner A, et al. (2011) Global DNA demethylation during mouse erythropoiesis in vivo. Science 334: 799–802 doi:10.1126/science.1207306.
- 5. Kriaucionis S, Heintz N (2009) The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 324: 929–930 doi:10.1126/science.1169786.
- 6. Suzuki H, Toyota M, Carraway H, Caraway H, Gabrielson E, et al. (2008) Frequent epigenetic inactivation of Wnt antagonist genes in breast cancer. Br J Cancer 98: 1147–1156 doi:10.1038/sj.bjc.6604259.
- 7. Easwaran HP, Van Neste L, Cope L, Sen S, Mohammad HP, et al. (2010) Aberrant silencing of cancer-related genes by CpG hypermethylation occurs independently of their spatial organization in the nucleus. Cancer Res 70: 8015–8024 doi:––10.1158/0008–5472.CAN-10–0765.
- 8. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, et al. (2008) Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133: 523–536 doi:10.1016/j.cell.2008.03.029.
- 9. Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, et al. (2009) Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics 1: 177–200 doi:10.2217/epi.09.14.
- 10. Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, et al. (2005) Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res 33: 5868–5877 doi:10.1093/nar/gki901.
- 11. Serre D, Lee BH, Ting AH (2010) MBD-isolated Genome Sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Res 38: 391–399 doi:10.1093/nar/gkp992.
- 12. Dedeurwaerder S, Defrance M, Calonne E, Denis H, Sotiriou C, et al. (2011) Evaluation of the Infinium Methylation 450 K technology. Epigenomics 3: 771–784 doi:10.2217/epi.11.105.
- 13. Fackler MJ, Umbricht CB, Williams D, Argani P, Cruz L-A, et al. (2011) Genome-wide methylation analysis identifies genes specific to breast cancer hormone receptor status and risk of recurrence. Cancer Res 71: 6195–6207 doi:––10.1158/0008–5472.CAN-11–1630.
- 14. Zhuang J, Jones A, Lee S-H, Ng E, Fiegl H, et al. (2012) The dynamics and prognostic potential of DNA methylation changes at stem cell gene loci in women’s cancer. PLoS Genet 8: e1002517 doi:10.1371/journal.pgen.1002517.
- 15. Wang D, Yan L, Hu Q, Sucheston LE, Higgins MJ, et al. (2012) IMA: an R package for high-throughput analysis of Illumina’s 450 K Infinium methylation data. Bioinformatics 28: 729–730 doi:10.1093/bioinformatics/bts013.
- 16. Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185–193. doi: 10.1093/bioinformatics/19.2.185
- 17. Shen J, Wang S, Zhang Y-J, Wu H-C, Kibriya MG, et al.. (2012) Exploring genome-wide DNA methylation profiles altered in hepatocellular carcinoma using Infinium HumanMethylation 450 BeadChips. Epigenetics 8.
- 18. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, et al. (2011) Integrative genomics viewer. Nat Biotechnol 29: 24–26 doi:10.1038/nbt.1754.
- 19. Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, et al. (2012) Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol 41: 200–209 doi:10.1093/ije/dyr238.
- 20. Maksimovic J, Gordon L, Oshlack A (2012) SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips. Genome Biol 13: R44 doi:10.1186/gb-2012-13-6-r44.
- 21. Blair JD, Price EM (2012) Illuminating potential technical artifacts of DNA-methylation array probes. Am J Hum Genet 91: 760–762 doi:10.1016/j.ajhg.2012.05.028.
- 22. Chen Y, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, et al. (2013) Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics 8: 203–209 doi:10.4161/epi.23470.