Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Analysis and Visualization Tool for Targeted Amplicon Bisulfite Sequencing on Ion Torrent Sequencers

  • Stephan Pabinger ,

    stephan.pabinger@ait.ac.at

    Affiliation Health & Environment Department, Molecular Diagnostics, AIT–Austrian Institute of Technology, Vienna, Austria

  • Karina Ernst,

    Affiliation Health & Environment Department, Molecular Diagnostics, AIT–Austrian Institute of Technology, Vienna, Austria

  • Walter Pulverer,

    Affiliation Health & Environment Department, Molecular Diagnostics, AIT–Austrian Institute of Technology, Vienna, Austria

  • Rainer Kallmeyer,

    Affiliation Health & Environment Department, Molecular Diagnostics, AIT–Austrian Institute of Technology, Vienna, Austria

  • Ana M. Valdes,

    Affiliation The Department of Twin Research & Genetic Epidemiology, King’s College London, St Thomas’ Campus, London, United Kingdom

  • Sarah Metrustry,

    Affiliation The Department of Twin Research & Genetic Epidemiology, King’s College London, St Thomas’ Campus, London, United Kingdom

  • Denis Katic,

    Affiliation Platomics GmbH, Vienna, Austria

  • Angelo Nuzzo,

    Affiliation Platomics GmbH, Vienna, Austria

  • Albert Kriegner,

    Affiliation Platomics GmbH, Vienna, Austria

  • Klemens Vierlinger,

    Affiliation Health & Environment Department, Molecular Diagnostics, AIT–Austrian Institute of Technology, Vienna, Austria

  • Andreas Weinhaeusel

    Affiliation Health & Environment Department, Molecular Diagnostics, AIT–Austrian Institute of Technology, Vienna, Austria

Analysis and Visualization Tool for Targeted Amplicon Bisulfite Sequencing on Ion Torrent Sequencers

  • Stephan Pabinger, 
  • Karina Ernst, 
  • Walter Pulverer, 
  • Rainer Kallmeyer, 
  • Ana M. Valdes, 
  • Sarah Metrustry, 
  • Denis Katic, 
  • Angelo Nuzzo, 
  • Albert Kriegner, 
  • Klemens Vierlinger
PLOS
x

Abstract

Targeted sequencing of PCR amplicons generated from bisulfite deaminated DNA is a flexible, cost-effective way to study methylation of a sample at single CpG resolution and perform subsequent multi-target, multi-sample comparisons. Currently, no platform specific protocol, support, or analysis solution is provided to perform targeted bisulfite sequencing on a Personal Genome Machine (PGM). Here, we present a novel tool, called TABSAT, for analyzing targeted bisulfite sequencing data generated on Ion Torrent sequencers. The workflow starts with raw sequencing data, performs quality assessment, and uses a tailored version of Bismark to map the reads to a reference genome. The pipeline visualizes results as lollipop plots and is able to deduce specific methylation-patterns present in a sample. The obtained profiles are then summarized and compared between samples. In order to assess the performance of the targeted bisulfite sequencing workflow, 48 samples were used to generate 53 different Bisulfite-Sequencing PCR amplicons from each sample, resulting in 2,544 amplicon targets. We obtained a mean coverage of 282X using 1,196,822 aligned reads. Next, we compared the sequencing results of these targets to the methylation level of the corresponding sites on an Illumina 450k methylation chip. The calculated average Pearson correlation coefficient of 0.91 confirms the sequencing results with one of the industry-leading CpG methylation platforms and shows that targeted amplicon bisulfite sequencing provides an accurate and cost-efficient method for DNA methylation studies, e.g., to provide platform-independent confirmation of Illumina Infinium 450k methylation data. TABSAT offers a novel way to analyze data generated by Ion Torrent instruments and can also be used with data from the Illumina MiSeq platform. It can be easily accessed via the Platomics platform, which offers a web-based graphical user interface along with sample and parameter storage. TABSAT is freely available under a GNU General Public License version 3.0 (GPLv3) at https://github.com/tadkeys/tabsat/ and http://demo.platomics.com/.

Introduction

DNA methylation is one of the most important epigenetic modifications of the eukaryotic genome and plays essential roles in several biological processes, such as alternative splicing [1], regulation of temporal and spatial gene expression [2,3], as well as genome stabilization [4]. It is the most widely studied epigenetic modification in humans and describes the adding of a methyl group to DNA nucleotides, in humans typically occurring in a CpG dinucleotide context [5]. These CpG sites are often clustered together as CpG islands, which co-locate to distinct gene regions, modifying the expression of these genes. DNA methylation is heritable and known to cause genomic imprinting [6]. It is highly relevant for the study of human disease, as somatic alterations of the methylation status have been described to be associated with aging [7], atherosclerosis [8], and several human diseases [9], most notably cancer [10].

The gold-standard for studying single-base methylation is bisulfite sequencing, also referred to as BS-Seq [11]. Here, DNA is treated with bisulfite, which changes the DNA sequence depending on the methylation status of individual cytosine residues. Unmethylated cytosines are converted to uracil, which is amplified and read by the sequencer as thymine, whereas methylated cytosines are left unconverted. Next, the generated sequences are compared to a known reference yielding single nucleotide resolution information about the methylation status of the DNA. In cases where whole bisulfite genome sequencing is too expensive or not required, a practical alternative is to limit the sequencing per individual to selected, meaningful targets. This approach, commonly termed as targeted sequencing, allows generating consistent data with high coverage around regions of interest [12].

Due to the low entry cost and medium throughput, bench-top next-generation DNA sequencers (454 GS Junior from Roche, Illumina MiSeq, or Ion Torrent Personal Genome Machine (PGM)) are especially equipped for targeted sequencing [13]. The PGM [14] applies a sequencing-by-synthesis approach, uses native dNTP chemistry, and relies on a modified silicon chip to detect hydrogen ions released during base incorporation by DNA polymerase. The sequencer generates single-end (SE) reads in varying quality and length. A known caveat of the PGM is its susceptibility to over-call or under-call the number of homopolymer bases, a feature which needs to be specifically addressed by dedicated downstream analysis methods and tools [15].

Bisulfite sequencing data analysis involves several steps including quality assessment, alignment, and methylation calling [16]. Several tools are available, which use different approaches to analyze the data [17,18]. An important part of the bisulfite sequencing workflow is the translation of raw sequence information into bisulfite calls for each investigated base. The widely used tool Bismark [19] contains multiple routines to carry out alignment of bisulfite-treated reads to a reference genome as well as cytosine methylation calling.

According to omicsmaps.com there are over 300 Ion Torrent PGM machines in use, which could potentially be applied to targeted bisulfite sequencing. However, currently no protocol or analysis solution for bisulfite sequencing on the PGM is officially provided. We have therefore created a novel tool called TABSAT for the analysis and visualization of targeted bisulfite sequencing data generated by Ion Torrent instruments. The tool accepts raw sequencing files as input and outputs result tables containing information about the methylation status of covered CpG sites. Read mapping and methylation calling is handled by Bismark, which has been modified to use the TMAP [20] mapper instead of the default mapper Bowtie2. Results are aggregated in tabular format and automatically visualized as lollipop figures. TABSAT has been designed to run with a minimal set of input parameters but can be customized to support specific questions. In addition, it can be used with data from the Illumina MiSeq platform. The software is freely available at https://github.com/tadkeys/tabsat/.

Overview of TABSAT

We have developed TABSAT, a tool for the analysis of targeted bisulfite sequencing data generated on Ion Torrent systems. It was implemented using Python (version 2.7.6), R (version 3.1.2), and Perl (version 5.18.2) and makes use of several third party tools, including PRINSEQ [21], Bismark [19], and TMAP (version 4.4.8) to address different tasks of the analysis workflow (depicted in Fig 1). TABSAT is available as a standalone application (either as Docker image or as installable source code) and as an application in the web-based Platomics platform (see Fig 2 and section “Availability of TABSAT”). In addition to Ion Torrent data, TABSAT can be used with data from the Illumina MiSeq platform. The tool is capable of handling data generated by Bisulfite-Sequencing PCR (BSP) [22] as well as Methylation-specific PCR (MSP) [23].

thumbnail
Fig 1. The analysis workflow of TABSAT is shown.

The first step conducts quality assessment and generation of a report. Next, sequences are mapped to a reference genome and the methylation information is extracted. Based on several quality statistics and thresholds the generated results are filtered and aggregated into a final output table. In the next step, reads covering all CpGs in a target region are used to calculate methylation-pattern statistics, which are subsequently compared between samples. The last step creates the final output table, graphical representations of the results, and reports basic statistics generated during the analysis workflow.

https://doi.org/10.1371/journal.pone.0160227.g001

thumbnail
Fig 2. The Platomics input page of TABSAT is displayed.

After setting the correct parameters, a new analysis is started by clicking on the “Start Job” button. Running and finished analyses are displayed in the jobs panel and can be selected by clicking on the corresponding job.

https://doi.org/10.1371/journal.pone.0160227.g002

Upon execution of TABSAT, data is gathered from the user-specified sources and processed by the application. Results are organized into folders and stored under the given output path (see section “Using TABSAT”). The first step of the pipeline performs quality assessment using the PRINSEQ [21] software. For each input file, summary statistics, including read length, GC content, quality score distributions, and the number of read duplicates, are created. These metrics along with graphics are summarized in HTML reports, which can be easily displayed in a web browser. After inspection of the quality results, the user can decide which TABSAT parameters (see section “Using TABSAT”) should be used for further downstream processing. By default low quality bases (phred score < 20) are clipped at the end of reads. In addition, filtering of reads based on a user defined length can be performed, which removes all reads that do not meet the minimum length requirement. After finishing the quality control step a second HTML report is generated, which allows users to compare the quality of data before and after quality assessment.

The subsequent alignment module uses Bismark to map the reads to the defined reference genome (download links for human genome version hg19 and mm10 are provided at http://github.com/tadkeys/tabsat). In order to improve the performance of Bismark with Ion Torrent data, we have included the TMAP mapper into the program (see section “Modifications to Bismark”). The pipeline uses the non-directional Bismark setting as default, but can be changed to directional depending on the library used. Next, methylation levels are determined based on the previous mapping result. The final output table (see Fig 3) is generated using a BED [24] file, which is created during the methylation calling step. Alignment strand-specific as well as CpH information is available in additional output files.

thumbnail
Fig 3. The result of a TABSAT run using the Platomics platform is shown.

The output table of a run is presented in the upper part of the large panel containing the sequencing results for each CpG site of all samples. The lower panel shows the graphical representation of the methylation results. Additional information, such as mapping statistics and quality control information, about each sample can be accessed by clicking on the corresponding links.

https://doi.org/10.1371/journal.pone.0160227.g003

Due to the PGM’s ability to produce reads with a length of up to 400 base pairs, entire PCR amplicons can be covered by single reads. Consequently, it is possible to determine the methylation pattern across a single amplicon by taking the information from reads spanning the whole amplicon target (see Fig 4). For example, an amplicon containing 3 CpG sites has 8 (23) different methylation patterns, ranging from fully unmethylated to fully methylated. For each sample and for each target the first step is the selection of all reads, which completely cover the target. Based on a user adjustable value, the user can select the percentage of the target that needs to be covered by a single read. Next, the different methylation-patterns are extracted, aggregated, and ranked based on their occurrence. In a final step, the patterns of all samples are compared with each other and their respective frequency is reported in a text document.

thumbnail
Fig 4. This image illustrates the determination and comparison of methylation patterns.

The top line shows an example target region where the CpG coordinates are highlighted. Displayed below are several reads mapping to this target with different methylation patterns. From these reads methylation-patterns are extracted, ranked according to their frequency, and the patterns are compared between samples.

https://doi.org/10.1371/journal.pone.0160227.g004

The output creation module aggregates the generated methylation information of all individual targets for all samples into a final table (see Fig 3). Several filtering steps are applied to remove false methylation calls (see section Using TABSAT): first low coverage methylation calls are removed from the output using a cutoff filter that can be set as a parameter. In addition, TABSAT determines the average coverage for each CpG per target/sample using the methylated and unmethylated coverage information. Next, it calculates a mean from these averages for each sample/target combination, which is used to remove every methylation call where the total coverage of a CpG is lower than the calculated mean coverage minus one standard deviation. As we use an amplicon sequencing approach, we expect a uniform coverage for each sample/target. By taking the average coverage of all CpGs within one sample/target and comparing it to the sum of each CpG we ensure to remove only extreme outliers. In addition, all CpGs that are not on the user specified strand are filtered out.

Finally, for each sample and for each CpG in the target region the number of methylated reads, the number of unmethylated reads, and the corresponding methylation percentages are reported. The results are also visualized as lollipop diagrams (see Fig 5) providing an intuitive way to investigate the methylation of the CpGs. Depending on the size of the target regions, it might be the case that all CpGs of the target region cannot be adequately displayed if they are spaced according to their position in the target region. Therefore, two versions of the lollipop diagram are created: a) CpGs are equally spaced along the target; b) CpGs are spaced according to their actual chromosomal coordinate. In addition to the graphical representations, general statistics, such as mapping results, and target coverage are reported.

thumbnail
Fig 5.

Displayed is an example of the two automatically generated lollipop diagrams containing the CpGs of the target region: a) all CpGs equally spaced across the target; b) all CpGs spaced according to their true chromosomal coordinate.

https://doi.org/10.1371/journal.pone.0160227.g005

Availability of TABSAT

TABSAT is open-source and freely available. The pipeline consists of several modules including quality assessment, genomic alignment, methylation calling, CpG filtering, merging of samples, as well as methylation pattern calculation, visualization, and summarization of results. The pipeline is available in three versions: i) as Docker container with configured dependencies, required software, human genome assembly hg19, and mouse genome mm10 (available at https://github.com/tadkeys/tabsat) ii) as source code to be installed on a local infrastructure (accessible at https://github.com/tadkeys/tabsat) and iii) as a Platomics (http://demo.platomics.com) web application.

The Platomics platform supports creation, execution, and distribution of life-science applications. It manages storing, retrieving, organizing, and analysis of data and provides an interface to create life-science applications. A configurable output dashboard enables analysis results to be presented according to the needs of the application. The Platomics TABSAT version can be freely accessed at http://demo.platomics.com and after successful login (credentials available on github page) the user can perform analyses by selecting TABSAT in the workspace panel on the left side. After setting the correct parameters, target, and input files, the analysis is started by clicking on the “Start Job” button (see Fig 2). Currently running analyses are displayed in the lower left panel and results can be viewed by clicking on the corresponding job (see Fig 3). The complete result set can be downloaded as a compressed file allowing further downstream analyses.

The user manual including information for the setup of the pipeline using the source code as well as the Docker container can be found at https://github.com/tadkeys/tabsat. Our pipeline has no limitations concerning the number or length of FASTQ sequences and the used reference genome assembly. Test datasets can be downloaded at https://github.com/tadkeys/tabsat. They include FASTQ files and target files.

Modifications to Bismark

Bismark is a software package containing routines to map reads to a reference genome (using Bowtie [25], or alternatively Bowtie 2 [26]) and determine their methylation state. It provides an attractive combination of processing speed, genomic coverage, and quantitative accuracy and is one of the most widely used tools to map and analyze bisulfite treated short reads [17,18,27]. The software is able to handle single-end as well as paired-end reads of both directional and non-directional bisulfite libraries. Bismark is open-source, written in Perl and executed from the command line. Due to the initial poor alignment results (see Discussion) and the specific error profile of reads produced by the Ion Torrent systems, we decided to replace Bismark’s default mapper Bowtie2 (2.2.4) with TMAP (4.4.8) [20], the read mapping program provided by the Ion Torrent suite. TMAP has been specifically optimized to handle reads from Ion Torrent sequencing platforms in order to meet specific data mapping challenges. We have therefore extended the source code of Bismark to support the use of TMAP as an additional mapping program. Specifically, a new method to create the index of the reference genome and to perform single-end (SE) alignment using TMAP was integrated.

Performance Evaluation

We processed the sequencing data of 48 samples (as described in the methods section) using the standalone version of TABSAT on a Linux Ubuntu 14.04 server with 16 cores and 32 GB memory. Using the Ion Torrent “314” sequencing chip, 16 barcoded samples (each having 53 amplicon targets) were multiplexed per run yielding on average approximately 25,000 reads per sample (~380 reads per amplicon) with an average length of 125 base pairs. In total 1,543,767 reads were analyzed, where 1,196,822 reads could be aligned to the human reference genome (hg19) yielding a mapping average of 77.9% and a mean target coverage of 282X. Results of all 48 samples were aggregated and reported in a final table.

Mapping performance

To assess the performance of the improved mapping capabilities, we used one of the three targeted bisulfite datasets (314 chip) with 16 samples (each with 53 targets) to compare the output of the modified Bismark-TMAP version to the default version of Bismark (0.13.1). Hereby, we could observe that the average number of mapped reads per dataset increased from 31,469 to 54,395 (45.23% to 78.12%). Consequently, the average coverage of the targets improved from 362X to 483X. In addition, we could observe an increase of the average coverage of CpG sites corresponding with the 450k chip (see below) from 392X to 599X (see Table 1). It should be noted that bisulfite conversion introduces additional homopolymer stretches, which are traditionally difficult to analyze on Ion Torrent systems. However, TABSAT analyses only CpG sites, and therefore avoids these homopolymer rich regions.

thumbnail
Table 1. Output comparison using 16 datasets between the standard version and the Ion Torrent optimized version of Bismark.

Parameters standard version: -bowtie2—non_directional; parameters Ion Torrent version: -tmap—non_directional.

https://doi.org/10.1371/journal.pone.0160227.t001

Correlation with Illumina Infinium HumanMethylation450

In order to assess the performance of the targeted bisulfite sequencing setup, we compared the obtained sequencing results to an Illumina 450k methylation chip experiment using the same 48 DNA samples as used for targeted bisulfite sequencing (see Fig 6).

thumbnail
Fig 6. An example target region with marked CpG positions and forward/reverse primers is shown.

The 450k target is depicted below as well as the CpGs measured using targeted sequencing. Grey color intensity is correlated to the methylation signal (450k) and percent methylation (sequencing).

https://doi.org/10.1371/journal.pone.0160227.g006

We correlated the methylation results of single Illumina 450k probes with the sequencing result of the corresponding single CpG for 53 targets per sample (see Fig 7).

thumbnail
Fig 7.

Presented is the correlation result between the Illumina 450k methylation chip and targeted bisulfite sequencing data: a) Barplot of the correlation for each sample; b) Scatterplot of 450k (x-axis) versus targeted bisulfite data (y-axis).

https://doi.org/10.1371/journal.pone.0160227.g007

Details about the selected amplicons, such as position, length, number of CpGs, melting temperature, GC percentage and number of mapped reads, are presented in Table 2 and S1 Table. The calculated median Pearson correlation yields 0.91 with a median R2 of 0.83. We also calculated the correlation between the average methylation of each target, instead of the single corresponding CpG, to the 450k result yielding a median correlation of 0.89 with a median R2 of 0.79.

thumbnail
Table 2. Details of the used amplicons: position on the human reference genome hg19, 450k methylation site ID, length of the target, number of CpGs, melting temperature, GC percentage, and average number of mapped reads (per sample).

https://doi.org/10.1371/journal.pone.0160227.t002

Discussion

We have developed a novel tool for the analysis of targeted bisulfite sequencing, which is especially equipped to handle, in addition to Illumina data, sequences generated on Ion Torrent systems. To date, several tools exist for the analysis of bisulfite data (reviewed here [17]), amplicons from bisulfite flowgram sequencing (Amplikyzer [28]) and locus-specific analysis of 5-methylcytosine (BiQ Analyzer HiMod [29]). However, none of these tools is specifically tailored for the analysis of Ion Torrent sequencing data and provides a one-stop solution from raw sequencing data to final results. Furthermore, the Ion Torrent PGM software platform currently does not support the analysis of bisulfite sequencing data. TABSAT comprises an analysis pipeline containing quality control, alignment, methylation calling, and output generation. In order to select the best mapping software for our purpose, we have evaluated several bisulfite analysis programs, such as Bismark [19], BS-Seeker2 [30], and BSMAP [31]. All programs produced similar or poorer mapping results, and offer different downstream analysis capabilities. Preliminary analysis with one input file using default parameters resulted in around 45%, 42%, and 20% of mapped reads for Bismark, BsSeeker2, and BSMap, respectively (see https://github.com/tadkeys/tabsat/tree/master/comp_tools). Based on the availability of the source code, the possibility to integrate a different mapping program, and the positive reviews, we decided to include Bismark in our workflow to handle mapping of sequencing reads and methylation calling.

Due to the initial suboptimal alignment results of the default Bismark version, we decided to incorporate TMAP [20] into the Bismark software, a dedicated mapper for Ion Torrent reads. Reads from Ion Torrent sequencing devices are usually longer than their Illumina counterparts and show a distinct different error profile, especially in homopolymer regions. The boost in read length causes an increased number of sequencing errors per read, which requires changing the mapping settings as the standard parameters for controlling mapping error may be too strict. As the supported aligners in Bismark (bowtie or bowtie2) are configured to be used with Illumina sized reads, we decided to include the Ion Torrent TMAP program, which has been designed to overcome these limitations.

Consequentially, we could show that the number of mapped reads increased on average from 31,469 to 54,395 (45.2% to 78.1%) on an evaluation dataset including 16 samples. Therefore, users will be able to process more samples on a chip, reducing the cost per sample. In order to check for correct mapping, we compared the number of positions where reads mapped outside of the target regions between the default and the TMAP Bismark versions. We expect reads to map outside of the target region, for example, due to unspecific PCR amplification. On average, we observed 2.7 additional positions where reads mapped outside of the target region when running the TMAP version, which can be explained due to the increased mapping potential of the TMAP program.

TABSAT supports the analysis of multiple samples (e.g. all barcoded samples on one chip) at once and outputs a combined result table for all samples. This facilitates interpretation and comparison of multi-sample studies. An important issue when working with sequencing data is to remove false results introduced by sequencing errors. Therefore, we have included a filtering mechanism into the workflow to reduce the number of uncertain methylation calls based on read coverage and artifacts. This strategy allows the generation of reliable results from which to draw biological meanings. Another important part of data analysis is the intuitive visualization of results. We have incorporated an automatic graph generation procedure which outputs two different types of lollipop diagrams. In addition, results are reported in tabular format, which can be used in further downstream analysis methods.

Since the Ion Torrent PGM is capable of producing reads with a length of up to 400 base pairs, we can use them to extract specific methylation-patterns. As one read originates from one distinct biological source, it could be possible to deduce biological causes that lead to different methylation-patterns in one sample. Especially interesting is the comparison of methylation-patterns between different biological groups, which may contribute to new methods for group classification. Consequently, descriptive statistics based on these methylation-patterns are automatically calculated for each sample and subsequently compared between samples.

The whole analysis solution has been designed to work with minimal user input and outputs results in clearly arranged tables and lollipop diagrams. The large number of Ion Torrent PGM sequencers available world-wide shows that there is a large community, which would benefit from this tailored analysis of bisulfite sequencing data. In addition, more than 100 Ion Proton sequencers have been registered on omicsmaps.com, which generate reads with a similar error profile as the PGM. Consequently, they would also benefit from a dedicated analysis workflow for generating high-quality results. Therefore, the presented work will help to unlock the power of Ion Torrent (and potentially Ion Proton) for bisulfite sequencing and DNA methylation analysis.

TABSAT is designed for amplicon studies and does not support the analysis of whole genome bisulfite sequencing projects. Furthermore, it is not limited to, but works best with amplicons smaller than 500bp as these can be efficiently visualized using lollipop diagrams. The aim of TABSAT is to cover the primary analysis of raw targeted bisulfite sequencing data to obtain methylation information for each analyzed cite. Given the provided comprehensive output, researchers can use additional downstream tools, such as the R project for statistical computing, to compare the methylation level between different groups of samples.

To assess the accuracy of TABSAT, we compared the sequencing results with data from an Illumina 450k methylation chip. The calculated median correlation of 0.91 confirms that targeting bisulfite sequencing on an Ion Torrent PGM yields accurate and reproducible results. A recent publication [32] has demonstrated effective methylated DNA immunoprecipitation sequencing (MeDIP-Seq) [33] on an Ion Torrent PGM, which was successfully validated using a 450k chip. This study shows the practicability of performing DNA methylation sequencing studies on a PGM and emphasizes the need for dedicated analysis solutions.

TABSAT can be used as standalone software, conveniently available as a Docker container. Docker containers wrap up software in a complete filesystem that contains everything it needs to run, making it an ideal solution to execute bioinformatics software in a self-contained and precisely controlled environment [34]. In addition, TABSAT is available as an embedded application within the Platomics life-science data analysis system. This graphical, web-based user interface has been especially designed to be usable without informatics knowledge. The results are presented in an intuitive way enabling an exploration of the data. As the complete result set can be downloaded as a compressed file, further in-depth downstream analysis can be easily performed.

In summary, TABSAT offers a novel and unreported way to analyze and interpret targeted bisulfite sequencing data.

Materials and Methods

Sample Material

DNA samples from the TwinsUK cohort were provided by the Department of Twin Research at King’s College London (KCL), TwinsUk is the largest registry of adult twins in the United Kingdom. It started in 1992 and currently encompasses approximately 12,000 volunteer twins from all over the UK [35]. The study was approved by St. Thomas’ Hospital Research Ethics Committee, and all twins provided informed written consent to participate in the study. 24 monozygotic twin pairs (n = 48 samples) with a difference of 800 or more grams in birthweight between twins were selected for this study. DNA isolation was done from whole blood at the KCL’s laboratories and subsequently provided to AIT alongside sample annotation

Illumina Infinium HumanMethylation450 BeadChip

gDNA isolated from peripheral blood of 24 MZTs were subjected to DNA methylation analyses using the Illumina Infinium HumanMethylation450 BeadChip (450k Chip) following the manufacturer’s protocol. Briefly, 500 ng DNA per sample was deaminated using the Zymo EZ DNA Methylation Kit. The contained bisulfite solution converted unmethylated cytosines into uracil, by removing the amino group of the cytosine, while 5’-methylated cytosines remained unaffected. Incubation was done by applying 16 cycles of a two-step temperature program, starting at 95°C for 30 seconds followed by 50°C for 60 minutes.

The deaminated DNA was eluted in 12 μl elution buffer of which 4 μl were subjected to the 450k protocol. That protocol combines a genome wide amplification, an enzyme based targeted fragmentation, and a cleanup of the DNA prior to hybridization onto the bead chips. Since one 450k chip allows the investigation of 12 samples, the 48 samples were randomized before hybridization on four 450k chips. The methylation module of Illumina’s Genome Studio Software was used to subtract the background from the raw data and to perform an Illumina specific normalization which depends on internal control probes. Received beta-values were exported to a text-file for further analysis using R (V2.15) and the IMA package [36].

Targeted bisulfite sequencing

Fifty-three differentially methylated loci (DML) identified by the 450k chip were selected for further analyses by targeted bisulfite sequencing using the Ion Torrent PGM platform (see S1 Table). MSRE-HTPrimer [37] was used to design high quality assays with a length of 150–320 bp for these 53 DML. BSP assays to amplify the methylated and unmethylated alleles were designed to cover the region of each DML probe of the 450k chip. Consequently, each individual assay contains the CpG of interest and multiple additional CpGs (typically up to 30). The designed BSP assays were setup and subsequently used to enrich the respective regions in the same 48 samples used for the 450k analysis. Target enrichment was done by qPCR single reaction, followed by pooling together all 53 targets for each sample. Library preparation and targeted sequencing on the PGM was conducted according to the manufacturer protocol. In brief, an individual barcode and a sequencing adapter were attached to the pooled targets (53) of each sample. Next, a pool containing all targets and all barcoded samples (48 samples each with 53 targets) was prepared for the final sequencing protocol, which includes an emulsion PCR to enrich the targets on the microspheres. The 48 samples were split into 3 equal batches of 16 samples. Each batch was loaded onto an Ion 314 chip and analyzed on the PGM.

Using TABSAT

In order to run the command line version of TABSAT, a system with a Linux operating system and current hardware (50 GB HDD, 16GB RAM) is required. Installation descriptions are detailed at http://github.com/tadkeys/tabsat. The Platomics version of TABSAT can be freely used and does not require any installation. In order to try-out the software please consult the information at https://github.com/tadkeys/tabsat/blob/master/demo.md. TABSAT can be used with the following parameters:

Target file [-t]: The first input is the target file containing the region of interest. The pipeline accepts a tab-separated text file containing name, chromosome, start position, end position, and strand for all targets.

Library [-l]: This input can be either directional (DIR) or non-directional (NONDIR) depending on the used bisulfite sequencing adapters.

Genome [-g]: This parameter can be either hg19 (human) or mm10 (mouse).

Sequencing library [-e]: This input can be either SE (single end) or PE (paired-end).

Aligner [-a] (optional): This input selects the aligner (TMAP or Bowtie2) used for mapping the reads to the reference genome.

Minimum read length [-m] (optional): This parameter is used for filtering reads that are shorter than the given threshold.

Minimum 3’ read quality [-q] (optional): Bases that are below the given threshold are removed from the 3’ end of the reads.

Percent of target covered by a read for pattern creation [-p] (optional): This value specifies the percentage of the target that needs to be covered by a read to include it for pattern analysis.

Minimum number of mapped reads per CpG [-r] (optional): Number of reads that need to be present at each CpG site.

Lollipop sort order [-s] (optional): List of samples that is used to specify the order in the lollipop plots.

Output directory [-o]: This parameter determines where the output of the analysis run is stored.

Input files: With the last input the user specifies a list of FASTQ files (one or many targeted deep bisulfite sequencing runs) as produced by the sequencer. If a barcoded library was used, FASTQ files need to be generated for all barcodes before use (barcode splitting is not performed by TABSAT). Optionally the user can specify a directory (using–d) containing several FASTQ files.

Supporting Information

S1 Table. This table contains information about the 53 used amplicon targets for bisulfite sequencing.

https://doi.org/10.1371/journal.pone.0160227.s001

(XLSX)

Acknowledgments

This work was supported by the European Union, FP7 small medium focused project 277849 EurHEALTHAgeing (http://eurhealth.org/).

Author Contributions

Conceived and designed the experiments: SP KE WP RK AV SM DK AN AK KV AW. Performed the experiments: SP KE WP DK. Analyzed the data: SP KE WP. Contributed reagents/materials/analysis tools: WP AV SM DK AW. Wrote the paper: SP WP RK AW.

References

  1. 1. Maunakea AK, Chepelev I, Cui K, Zhao K. Intragenic DNA methylation modulates alternative splicing by recruiting MeCP2 to promote exon recognition. Cell Res. 2013; 23: 1256–1269. pmid:23938295
  2. 2. Muers M. Gene expression: Disentangling DNA methylation. Nat. Rev. Genet. 2013; 14: 519. pmid:23797851
  3. 3. Wagner JR, Busche S, Ge B, Kwan T, Pastinen T, Blanchette M. The relationship between DNA methylation, genetic and expression inter-individual variation in untransformed human fibroblasts. Genome Biol. 2014; 15: R37. pmid:24555846
  4. 4. Jones PA, Gonzalgo ML. Altered DNA methylation and genome instability: a new pathway to cancer. Proc. Natl. Acad. Sci. U.S.A. 1997; 94: 2103–2105. pmid:9122155
  5. 5. Portela A, Esteller M. Epigenetic modifications and human disease. Nat. Biotechnol. 2010; 28: 1057–1068. pmid:20944598
  6. 6. Paulsen M, Ferguson-Smith AC. DNA methylation in genomic imprinting, development, and disease. J Pathol. 2001; 195: 97–110. pmid:11568896
  7. 7. Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013; 14: R115. pmid:24138928
  8. 8. Dong C, Yoon W, Goldschmidt-Clermont PJ. DNA methylation and atherosclerosis. J. Nutr. 2002; 132: 2406S–2409S. pmid:12163701
  9. 9. Robertson KD. DNA methylation and human disease. Nat. Rev. Genet. 2005; 6: 597–610. pmid:16136652
  10. 10. Kulis M, Esteller M. DNA methylation and cancer. Adv. Genet. 2010; 70: 27–56. pmid:20920744
  11. 11. Noehammer C, Pulverer W, Hassler MR, Hofner M, Wielscher M, Vierlinger K, et al. Strategies for validation and testing of DNA methylation biomarkers. Epigenomics. 2014; 6: 603–622. pmid:25531255
  12. 12. Mertes F, Elsharawy A, Sauer S, van Helvoort J, van der Zaag P, Franke A, et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genomics. 2011; 10: 374–386. pmid:22121152
  13. 13. Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol. 2012; 30: 434–439. pmid:22522955
  14. 14. Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011; 475: 348–352. pmid:21776081
  15. 15. Bragg LM, Stone G, Butler MK, Hugenholtz P, Tyson GW. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 2013; 9: e1003031. pmid:23592973
  16. 16. Krueger F, Kreck B, Franke A, Andrews SR. DNA methylome analysis using short bisulfite sequencing data. Nat. Methods. 2012; 9: 145–151. pmid:22290186
  17. 17. Tran H, Porter J, Sun M, Xie H, Zhang L. Objective and comprehensive evaluation of bisulfite short read mapping tools. Adv Bioinformatics. 2014; 2014: 472045. pmid:24839440
  18. 18. Kunde-Ramamoorthy G, Coarfa C, Laritsky E, Kessler NJ, Harris RA, Xu M, et al. Comparison and quantitative verification of mapping algorithms for whole-genome bisulfite sequencing. Nucleic Acids Res. 2014; 42: e43. pmid:24391148
  19. 19. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011; 27: 1571–1572. pmid:21493656
  20. 20. IonTorrent. Torrent Suite; 2015. Available: https://github.com/iontorrent/TS.
  21. 21. Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011; 27: 863–864. pmid:21278185
  22. 22. Clark SJ, Harrison J, Paul CL, Frommer M. High sensitivity mapping of methylated cytosines. Nucleic Acids Res. 1994; 22: 2990–2997. pmid:8065911
  23. 23. Herman JG, Graff JR, Myöhänen S, Nelkin BD, Baylin SB. Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc. Natl. Acad. Sci. U.S.A. 1996; 93: 9821–9826. pmid:8790415
  24. 24. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002; 12: 996–1006. pmid:12045153
  25. 25. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10: R25. pmid:19261174
  26. 26. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012; 9: 357–359. pmid:22388286
  27. 27. Chatterjee A, Stockwell PA, Rodger EJ, Morison IM. Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Res. 2012; 40: e79. pmid:22344695
  28. 28. Sven Rahmann, Jasmin Beygo, Deniz Kanber, Marcel Martin, Bernhard Horsthemke, Karin Buiting. Amplikyzer: Automated methylation analysis of amplicons from bisulfite flowgram sequencing. 2013.
  29. 29. Becker D, Lutsik P, Ebert P, Bock C, Lengauer T, Walter J. BiQ Analyzer HiMod: an interactive software tool for high-throughput locus-specific analysis of 5-methylcytosine and its oxidized derivatives. Nucleic Acids Res. 2014; 42: W501–7. pmid:24875479
  30. 30. Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, et al. BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics. 2013; 14: 774. pmid:24206606
  31. 31. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics. 2009; 10: 232. pmid:19635165
  32. 32. Corley MJ, Zhang W, Zheng X, Lum-Jones A, Maunakea AK. Semiconductor-based sequencing of genome-wide DNA methylation states. Epigenetics. 2015; 10: 153–166. pmid:25602802
  33. 33. Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, et al. Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat. Genet. 2005; 37: 853–862. pmid:16007088
  34. 34. Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. The impact of Docker containers on the performance of genomic pipelines. PeerJ. 2015; 3: e1273. pmid:26421241
  35. 35. Moayyeri A, Hammond CJ, Valdes AM, Spector TD. Cohort Profile: TwinsUK and healthy ageing twin study. Int J Epidemiol. 2013; 42: 76–85. pmid:22253318
  36. 36. Wang D, Yan L, Hu Q, Sucheston LE, Higgins MJ, Ambrosone CB, et al. IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data. Bioinformatics. 2012; 28: 729–730. pmid:22253290
  37. 37. Pandey RV, Pulverer W, Kallmeyer R, Beikircher G, Pabinger S, Kriegner A, et al. MSRE-HTPrimer: a high-throughput and genome-wide primer design pipeline optimized for epigenetic research. Clin Epigenetics. 2016; 8: 26. pmid:26949424