FastDMA: An Infinium HumanMethylation450 Beadchip Analyzer

DNA methylation is vital for many essential biological processes and human diseases. Illumina Infinium HumanMethylation450 Beadchip is a recently developed platform studying genome-wide DNA methylation state on more than 480,000 CpG sites and a few CHG sites with high data quality. To analyze the data of this promising platform, we developed FastDMA which can be used to identify significantly differentially methylated probes. Besides single probe analysis, FastDMA can also do region-based analysis for identifying the differentially methylated region (DMRs). A uniformed statistical model, analysis of covariance (ANCOVA), is used to achieve all the analyses in FastDMA. We apply FastDMA on three large-scale DNA methylation datasets from The Cancer Genome Atlas (TCGA) and find many differentially methylated genomic sites in different types of cancer. On the testing datasets, FastDMA shows much higher computational efficiency than current tools. FastDMA can benefit the data analyses of large-scale DNA methylation studies with an integrative pipeline and a high computational efficiency. The software is freely available via http://bioinfo.au.tsinghua.edu.cn/software/fastdma/.


Introduction
DNA methylation is an important type of epigenetic modification in the human genome. The function of DNA methylation is traditionally described as silencing mark [1]. Recent studies suggest DNA methylation plays more sophisticated roles in chromatin structure and transcriptional regulation [2]. DNA methylation is highly dynamic but under strict control during development [3,4]. While in human diseases, especially in cancer, the DNA methylation states are usually significantly disrupted and those changes are strongly associated with cancer hallmarks [5][6][7].
There are four popular techniques for detecting genome-wide DNA methylation state: whole genome bisulphite sequencing [8], methylation array [9], reduced representation bisulfite sequencing [10] and enrichment based method [11]. Infinium HumanMethy-lation450 Beadchip is a recently developed methylation array platform which detects more than 480,000 cytosine sites along the entire human genome. It covers the majority of reference genes and shows high data reproducibility between technical replicates. Because of the lower cost and easier experimental protocol, this platform is suggested to be suitable for large-scale studies [12]. The Cancer Genome Atlas (TCGA) program, which aims to providing a comprehensive molecular portraits of all types of cancer, has already used this platform to profile the DNA methylation states of hundreds of clinical samples [13,14].
Here we propose our recently developed software, FastDMA, to help researchers to analyze the data generated by this platform, especially for large or clinical datasets. FastDMA uses a unified statistical model, analysis of covariance (ANCOVA), to do both single probe analysis and differentially methylated region (DMR) scanning. Technically, FastDMA is implemented as a standalone software in C++ which can be easily distributed and further developed. Besides, by using parallel computing technique, it can deal with large-scale datasets with very high computational efficiency.
This article is organized as follows: in the Methods section, we described the workflow, the computational model and the software implementation of FastDMA. In the Results section, we first applied FastDMA on three large-scale datasets from TCGA of breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), and prostate adenocarcinoma (PRAD). And then, we compared FastDMA with a recently published software IMA [15] for both the correctness and the computational performance. Finally, we discussed the major advantages of the software and its limitations waiting for further developments.

Overview
The workflow of FastDMA is shown in Figure 1. FastDMA can do data normalization and support both single probe and region based data analyses. The input of FastDMA is usually a data matrix, processed from the original fluorescence signal, containing the beta value (a value indicating the DNA methylation level, between 0,1) and the detection pvalue (a value indicating whether the signal is believable or not) of each probe on the beadchip. For the outputs, FastDMA generated human-readable table-limited text files and the formatted BED files compatible for UCSC Genome Browser visualization.
Except for the case-control comparison, multiple-group comparison is usually required. For example, it is required to compare the DNA methylation levels in three groups if we want to identify the differentially methylated sites among normal, primary tumor, and metastasis samples. Besides, other clinical co-variables (such as sex, age, etc.) rather than the group labels should also be considered. To deal with these issues in a unified framework, FastDMA used ANCOVA as the main statistical model, a generalized linear model combining ANOVA and regression.
For clearly presenting the statistical model, we denoted some variables as follows. The dataset contains n groups, G i , (1#i#n), to be compared. For i-th group G i , there are m i samples, S ij (1#j#m i ). The k-th probe of sample S ij has a methylation level (as beta-value) b ijk . Clinical information (such as sex, age, etc.) is treated as the covariable: C ijl stands for the value of the l-th co-variable of sample S ij . Gaussian noise is assumed and denoted as e. FastDMA rejects the data points with detection pvalues greater than 0.05 (unreliable data points).

Normalization
The normalization task assumes that all the studied samples have the same overall beta value distribution. FastDMA uses quantile normalization [16]: first, calculate the quantiles of beta values (logit transformed) in all the samples; then, calculate the averages of the quantiles; and finally, piecewise linear transformation is applied on each beta value such that every sample has the same quantiles as the average. Users can do normalization with any other methods by themselves and skip this step.

Single Probe Analysis
The first task is to identify the significantly differentially methylated probes. The null hypothesis is that if a probe is not differentially methylated, the beta values of that probe have the same mean values among all the groups. Otherwise, different means among the groups are required. ANCOVA is used in this task to compare two linear regression models that one assumes an overall mean across all the studied groups while the other assumes different means. For the k-th probe of sample S ij belonging to group G i , we have: The null hypothesis is formulated as The alternative hypothesis is formulated as C ijl stands for the value of the l-th co-variable of sample S ij . The middle terms in both equations are the effect of the co-variables. It is assumed to be linear.
ANCOVA compares these two linear regression models and gives a pvalue indicating to what extent one can reject the null hypothesis. Then, false discovery rate (FDR) is calculated according to Benjamini-Hochberg (BH) procedure. The pvalue and FDR thresholds to call significantly differentially methylated probes can be manually adjusted by users. To facilitate the following functional analysis, the probes are annotated (such as nearby genes) using a well-curated database IlluminaHuman-Methylation450k.db (Tim Triche and Jr. R package version 1.4.6.).

DMR Analysis in Predefined Regions
Single differentially methylated probe may be hard to be interpreted biologically. FastDMA provides region-based analysis, which can provide additional information. For a predefined region (such as a promoter or a CpG island) containing several probes, if the region is uniformly methylated among all the groups, then it is assumed that the beta values of every probe in that region are distributed with the same means independent with their group labels. Otherwise, the region is called to be differentially methylated. For the n-th region, there are r probes located in it denoted as P 1 , P 2 ,…, P r . The beta value of the probe P k of the sample S ij is denoted as b (n) ijk to emphasize that the probe belongs to the n-th region.
Then the null hypothesis for this region is formulated as The alternative hypothesis is formulated as Similar to the comparison of the two models in the single probe analysis, ANCOVA can compare the above two series of models and calculate the pvalues to evaluate their differences fitting the same data. And also, FDR is calculated using BH procedure.

DMR Analysis in Arbitrary Regions
In the second task, users should predefine the regions. FastDMA can also scan all the regions covered by probes in the whole genome. The idea to find arbitrary DMRs is to use a sliding window across the genome and test whether each sliding window is a DMR using the above statistical model. Then the overlapped DMRs are merged to give the final results. The default window size is set as 200 bp and only the windows with more than 2 probes are used in the analysis.
It is time-consuming to fit the models in ANCOVA of all the sliding windows across the whole genome. FastDMA provides another intuitive way to deal with this task: it directly uses the FDRs of the probes in a sliding window from the single probe analysis and simply calculates their geometric mean as the score of that region. Windows with this score less than 0.05 are treated as

Implementation
The computerized procedures are coded in C++. An additional script for preprocessing the format of the data is coded in Python. The software follows the GNU general public license version 3 for academic use and can be downloaded via the site http://bioinfo. au.tsinghua.edu.cn/software/fastdma. Both a simplified procedure and the advanced user manual to install and run FastDMA are provided on the website.

Datasets and Experiments
We applied FastDMA on three large-scale Infinium Human-Methylation450 Beadchip datasets from TCGA of BRCA, LUAD, and PRAD. The numbers of samples for the datasets are 120, 69, and 225 respectively.

Many Probes are Found Differentially Methylated in Tumor
FastDMA was used to calculate the pvalue and the absolute difference of the differential DNA methylation level of each probe. The results of the overall statistics are shown as volcano plots in figure 2. The top 10 tumor-hypermethylated probes are summarized in table 1. As we can see in figure 2, many probes are significantly differentially methylated. We set the threshold as pvalue,0.01 and FDR,0.05. FastDMA identifies about half of the probes as differentially methylated ones in BRCA and PRAD datasets. Although less than those two, the LUAD dataset still has more than one third of the probes identified as differentially methylated ones (detailed numbers are shown in table 2). These numbers agree with a recently published study on liver hepatocellular carcinoma [17]. These results imply that the cancer genome is globally changed in the consideration of DNA methylation state.
To illustrate the support for multiple-group and clinical covariable, we constructed a small-scale testing dataset including the array data of four BRCA metastatic tumor samples and randomly chosen size-matched primary tumor samples and normal samples from TCGA. The age at initial pathologic diagnosis is treated as the co-variable. We ran FastDMA to find the differentially methylated sites across normal, tumor and metastatic groups. With the threshold pvalue,0.01 and FDR,0.05, 29 probes are identified as differentially methylated (this small number is mainly due to too few available samples).

DMR Analysis Helps to Find Genomic Regions for Further Study
The DMR scanning can be done either in predefined regions or in arbitrary regions across the whole genome using the same computational model. Here, we only reported the results by scanning arbitrary regions. The statistical significance thresholds were set the same as the single probe analysis (pvalue ,0.01 and FDR ,0.05). Under such settings, FastDMA founds 60877, 48930, and 56179 DMR candidates (mainly spanning several hundred base pairs) in the BRCA, LUAD, PRAD datasets respectively. Table 3 shows the state (tumor-hypermethylated or tumor-hypomethylated) and location (CpG island or non-CGI) of these DMR candidates. Figure 3 is a collection of IGV [18] screenshots showing several cases. All these cases are hypermethylated in cancer and lie in the 59 region of a nearby gene (promoter region). These regions contain dense distribution of many differentially methylated probes and therefore can be interesting candidates for further functional or biomarker analysis.
To validate our results, we randomly selected 20 candidates in the LUAD datasets and then perform validation according to another statistical framework suggested by Jaffe et al [19]. All the 20 candidates have strong statistical significance (with pvalue less than 0.001).

FastDMA Finds Essentially the Same Differentially Methylated Probes as IMA does but Performs with much Higher Computational Efficiency
We compared FastDMA and IMA (under Student's t-test mode) on the three cancer DNA methylation with single probe analysis. The thresholds for identifying differentially probes were set as pvalue ,0.01 and FDR ,0.05. Table 2 shows the numbers of identified probes by IMA and FastDMA under different contexts. It is shown that more than 98% identified probes are overlapped in FastDMA and IMA, which indicates the correctness.
To compare the computational efficiency, we ran them on a set of testing datasets (generated from real data of TCGA) with different data sizes varying from 40 to 220. FastDMA can work in parallel manner, we ran FastDMA twice using either single thread or four parallel threads. To measure the time consumption exactly, we repeatedly ran the software three times on every testing dataset and calculated the average time consumption with standard deviation. As indicated in figure 4, FastDMA works with much higher efficiency.

Discussion
As a promising platform to detect genome-wide DNA methylation state, Infinium HumanMethylation450 Beadchip begins to be widely used with relatively lower cost. FastDMA provides an integrative pipeline and a high computational efficiency solution for analyzing this kind of data. This software can benefit the community of DNA methylation, especially for the large-scale clinical study. FastDMA uses general input format. Except 450 K Beadchip, it can also be used to analyze the data generated from other platforms, such as the previous Infinium HumanMethyla-tion27 Beadchip and the possibly future Infinium beadchips with the same technique.
It should be noted that FastDMA is not elaborately designed for the data normalization. Users can use any other sophisticated normalization procedure (such as in ref. [20]) rather than the simply quantile normalization if this step is crucial for the following data analysis. Serious concerns about DNA methylation array are raised that the platform may contain potential technical artifacts [21]. To solve this problem, we design a command line parameter to filter out probes with genomic variants or cohybridizing according to a recent work [22]. Also, any biological interpretations for the differential methylated probes should be made carefully.
FastDMA can smoothly run under major Linux/Unix systems using command line. For easier visualization of the results, the software formats the outputs as widely-used BED files that can be directly loaded into any genome browser tools. In the future, we will develop a graphic user interface for boarder users with less computer background.