Transcriptional Profiling of Endocrine Cerebro-Osteodysplasia Using Microarray and Next-Generation Sequencing

Background Transcriptome profiling of patterns of RNA expression is a powerful approach to identify networks of genes that play a role in disease. To date, most mRNA profiling of tissues has been accomplished using microarrays, but next-generation sequencing can offer a richer and more comprehensive picture. Methodology/Principal Findings ECO is a rare multi-system developmental disorder caused by a homozygous mutation in ICK encoding intestinal cell kinase. We performed gene expression profiling using both cDNA microarrays and next-generation mRNA sequencing (mRNA-seq) of skin fibroblasts from ECO-affected subjects. We then validated a subset of differentially expressed transcripts identified by each method using quantitative reverse transcription-polymerase chain reaction (qRT-PCR). Finally, we used gene ontology (GO) to identify critical pathways and processes that were abnormal according to each technical platform. Methodologically, mRNA-seq identifies a much larger number of differentially expressed genes with much better correlation to qRT-PCR results than the microarray (r2 = 0.794 and 0.137, respectively). Biologically, cDNA microarray identified functional pathways focused on anatomical structure and development, while the mRNA-seq platform identified a higher proportion of genes involved in cell division and DNA replication pathways. Conclusions/Significance Transcriptome profiling with mRNA-seq had greater sensitivity, range and accuracy than the microarray. The two platforms generated different but complementary hypotheses for further evaluation.


Introduction
New technologies permit the evaluation of global patterns of gene expression -mRNA levels -from healthy and diseased tissues. The simultaneous assessment of changes in expression of many genes -up to the whole genome level -can then be analysed simultaneously using bioinformatic tools that can reveal new patterns or networks of differentially regulated genes [1]. These technologies have transformed our conception of the molecular mechanisms underlying complex diseases such as cancer and degenerative illnesses [2,3]. Over the past five years, microarrayswhich are a hybridization-based technology -have been the main platform used for transcription profiling. However, within the last two years, high throughput next-generation mRNA sequencing methods now allow for quantitative measurement of expression levels on a genome-wide basis at the level of a single nucleotide.
We had the opportunity to compare technologies used to generate expression profiles of cultured fibroblasts from Amish children with a rare autosomal recessive condition called endocrine-cerebro-osteodysplasia (ECO; MIM 612651). ECO is a multi-system neonatal lethal disorder -a kinasopathy [4] affecting mainly the skeletal, cerebral and endocrine systems that results from a homozygous nonsynonymous mutation (R272Q) in the ICK gene encoding intestinal cell kinase [5]. ICK, also known as MAK-related kinase (MRK), is ubiquitously expressed, particularly in brain, spinal cord, testis, and ovary [6,7,8]. Catalytic domains of ICK share ,40% identity with those of consensus MAPKs, which are regulators of cell-cycle entry and transition by cyclin-dependent protein kinases (CDKs) [9]. Residue 272 lies within the nuclear localization signal sequence [5,9] and the R272Q mutation both impairs nuclear localization and reduces catalytic activity [5].
Since very little was known about the downstream biological pathways and gene networks that were affected in ECO patients, we profiled the transcriptomes of cultured skin fibroblasts from ECO patients. We used 2 independent technological platforms to accomplish this, namely cDNA microarrays and next-generation mRNA sequencing (mRNA-seq). This provided a unique opportunity to validate the findings of each platform using quantitative RT-PCR (qRT-PCR) and to compare the networks of genes that were identified by GOStat, a database that lists all the overrepresented GO terms according to statistical significance [10], and KEGG pathway, a collection of manually curated pathway maps [11].

Participants and Ethics Statement
Primary skin fibroblasts from two subjects affected with ECO (designated 030950 and 040786) and one unrelated unaffected from the community (070280) were obtained from forearm puncture biopsies from affected individuals. The skin fibroblast line AG03348 (or 3348) was obtained from the Coriell Cell Repository (Coriell Institute for Medical Research, Camden NJ), and served as another unaffected non-Amish control cell line. Cultured primary skin fibroblasts were maintained at 37uC and 5% CO 2 in Ham's F-10 medium (Gibco, Carlsbad CA) with Lglutamine supplemented with 10% fetal bovine serum and 16 antibiotic/antimycotic mixture (Gibco). For passaging, cells were released from the dish using 0.1% (w/v) trypsin and 0.02% (w/v) EDTA washes and re-distributed onto another dish. Samples from all passages were stored in 280uC.
Tissue samples were provided for research purposes, with approval by the Office of Research Ethics (University of Western Ontario). Participating parents provided informed consents and did not receive any financial compensation.

Skin Fibroblast Cell Line Doubling Measurements
Each cell line was passaged and maintained in 90 mm diameter dishes (Gibco) two to three times weekly. After release using trypsin and EDTA, washed cells were diluted in enriched Ham's F-10 medium. Ten microlitres of re-diluted cells were counted using a haemocytometer and seeded on fresh culture dishes. Cell number counted (n) was used to calculate the number of cells per mL (N), with the formula N = n610 4 . This procedure was carried out until the same number of cells or fewer was obtained from subculturing over three consecutive passages. Cell growth was measured by calculation of population doubling (PD) using the formula: where log H is the logarithm of the number of cells harvested after 3 or 4 days of growth and log S is the logarithm of the number of cells on the first day of each passage. Accumulated population doublings (APD) were calculated by the summation of PDs.

RNA Isolation
For each cell line, RNA from a ''young'' cultured age passage was extracted at an APD of 3-5 whereas RNA from an ''old'' cultured age passage was extracted at an APD of ,20-22. In total 8 samples were extracted, with young and old passages for affected cell lines 030950 and 040786 and for unaffected cell lines 070280 and 3348. RNA was extracted from cultured skin fibroblasts using the RNEasy kit (Qiagen, Mississauga, ON). Briefly, cells were lysed with a buffer containing guanidine-isothiocyanate and b-mercaptoehtanol. Genomic DNA was sheared using a Shredder column (Qiagen). Ethanol was added to the resulting solution allowing the RNA to bind to the silicia-gel-membrane spin column. Bound RNA was washed with ethanol and eluted with RNAse-free water. Once RNA was extracted, its concentration and purity were measured using the NanoDrop spectrophotometer (Thermo Scientific; Ottawa, ON) and the Agilent 2100 Bioanalyzer (Agilent Technologies; Palo Alto, CA). Samples were stored at 280uC.

cDNA Microarray Hybridization and Analysis
All microarrays were processed at the London Regional Genomics Centre (http://www.lrgc.ca). Biotinylated RNA was prepared from 2 mg of total RNA using the two-cycle amplification protocol. Double-stranded cDNA was synthesized using SuperScript II (Invitrogen, Carlsbad, CA) and oligonucleotide primers. Biotin-labelled complementary RNA (cRNA) with incorporated biotinylated UTP and CTP was prepared using in vitro transcription of cDNA with the Bizarre High-Yield RNA Transcript Labeling kit (Enzo Brioche, New York, NY). Fifteen micrograms of labelled cRNA was hybridized to Human 1.0 ST array GeneChips for 16 h at 45uC (Affymetrix, Santa Clara, CA). The chips were stained with streptavidin-phycoerythrin solution. Liquid handling was performed by the GeneChip Fluidics Station 450 (Affymetrix) and arrays were scanned using the GeneChip Scanner 3000 (Affymetrix). Signal intensities for genes were generated using GCOS1.4 (Affymetrix) using default values for the Statistical Expression algorithm parameters. Probe level data was imported into Genomics Suite software (Partek, St. Louis, MO); the student's paired t test was used to detect differences between them.

mRNA Deep Sequencing Platform Hybridization and Analysis
Five mg of total RNA was processed using proprietary kits from Illumina (Hayward, CA). Briefly, PolyA + RNA was isolated from total RNA fragmented using Ambion RNA fragmentation buffer. cDNA synthesis was performed with Invitrogen random hexamer primers and cDNA was purified using QIAquick PCR spin column (Qiagen). Ends were blunted and 39-A overhangs introduced using T4 DNA polymerase and E. coli DNA polymerase I Klenow fragment. cDNAs were ligated to adapters with a single 'T' base overhang. After selection of 150-200 bp fragments from 2% low-range agarose gel, samples were amplified by 18 PCR cycles to enrich cDNAs with correctly ligated adapters and to amplify the amount of DNA in the library. Samples were loaded on a Cluster Station to create flow cells of clonal single molecular array (CSMA) and sequenced on the Illumina platform [12]. The analysis pipeline encompassed primary data acquisition, base calling, and calculating confidence scores from the fluorescence signals on the Genome Analyzer. Each transcriptome was sequenced at a depth of 30-40 million single reads, with read lengths up to 75 bp. Raw reads were converted to FASTQ data format since this format compactly stores a quality score for each base, which could be used to filter individual sequences. The quality-filtered reads were then aligned by TopHat [13], which map them to both the UCSC reference human genome and exonexon splice junctions as annotated by Ensembl. Cufflinks [14] then provided the gene expression levels, based on the TopHat alignments and Ensembl annotation. Gene expression was quantified as 'reads per kilobase of exon model per million mapped reads' (RPKM) [15], and the expression cutoff was 0.5 RPKM -that is, the transcript of the gene was present if there were $10 reads that mapped uniquely to a single genomic locus. More than 18,815 Ensembl annotated protein-coding genes were compared to create a gene list of differentially expressed genes based on disease status of the cell lines. Transcript level data were then imported into Genomics Suite (Partek, St. Louis, MO) for additional analyses; comparisons were performed using student's paired t test.
Total RNA (100 ng) was reverse transcribed using the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems) in a 20 mL reaction volume and amplified using TaqMan Assay probes (Applied Biosystems) in a 7900 HT Real Time PCR System (Applied Biosystems) with the 40 cycle amplification protocol. Amplified sequences were detected using the Prism sequence detector (Applied Biosystems) according to manufacturer's instructions. Experiments were done in triplicate, using GAPDH as an internal reference, on the young and old age passages from affected cell lines 030950 and 040786 and from unaffected cell lines 070280 and 3348. Expression values were standardized to values obtained with the standard RNA using the delta Ct method. Standard curves had r 2 values.0.98.

Biological Interpretations
The cDNA microarray data set was first run on the gene ontology (GO)-ANOVA analysis tool (Partek). The mRNA-seq data set was also biologically interpreted using GO through a web-

Hyperproliferation of Cultured Skin Fibroblasts from ECO-Affected Individuals
The complete lifespan growth curves of the cell lines are shown (Figure 1). To reach senescence, the 030950 cell line was passaged 27 times for 96 days, 040786 was passaged 25 times for 88 days, 070280 was passaged 25 times for 88 days, and 3348 was passaged 19 times for 66 days. The mean PD from the 2 normal fibroblast lines was 0.5960.03, while the mean PD for the 2 ECO-affected fibroblast cell lines was 1.0760.05 (P,0.05). These findings, obtained from experiments performed in duplicate, indicated an approximate doubling of the proliferation rate in affected skin fibroblasts compared to unaffected skin fibroblasts.

Transcriptome Profiling
Both microarray and mRNA-seq platforms showed consistently high numbers of genes that were differentially expressed in ECO-affected versus unaffected fibroblasts. Table 1 shows a comparison of the number of significantly differentially expressed genes based on fold change determined by each platform. Overall, mRNA-seq identified a greater number of differentially expressed genes with fold changes $2.0 than microarrays (708 versus 206, respectively), with virtually identical numbers of genes with fold changes between 1.0 and 2.0 (1636 versus 1592, respectively) indicating that mRNA-seq was more sensitive in identifying significantly differentially expressed genes than the cDNA microarray.
The cDNA microarray identified more downregulated genes, while the mRNA-seq platform identified more upregulated genes. Most differentially expressed genes from the cDNA microarray had a fold-change range from 2 to 10, while the differentially expressed genes from the mRNA-seq platform had a greater range in foldchange values, and many more genes with fold changes $20.0.
We next validated 20 significant differentially expressed genes from the mRNA-seq platform using qRT-PCR. Candidate genes were chosen based on conventional criteria [16] such as .2-fold change between conditions, with P#0.05, regardless of the 'age' or passage number (see Table 2). By inspection, the direction and degree of fold-changes were more similar to the qRT-PCR findings for mRNA-seq identified genes than for microarray identified genes. Also, by inspection, there appeared to be systematic underestimation in fold-change values from the cDNA microarray data set, for about half of the validation gene set, namely SOD3, CRIP1, MEST, DKK2, LXN, CCRL1, FAM20A, DYNC1I1, HTR1B, and RASGRP1. Also, 5 genes, namely STC2, RAP1B, DYNC1I1, KIF23 and GPR160 were not statistically different in terms of gene expression between the platforms. There was much better correlation between the mRNA-seq platform and qRT-PCR values (r 2 = 0.794, P = 7.10610 27 , Figure 2A) than between the cDNA microarray and qRT-PCR (r 2 = 0.137, P = 0.12, Figure 2B).

Biological Interpretations of Differentially Expressed Genes
Using GOStat [10] we determined the top 20 overrepresented GO terms based on the total number of genes that were significantly (P,0.05) differentially expressed from microarray and mRNA-seq platforms (Tables 3 and 4, respectively). GO categories identified as significant by microarray tended towards anatomical and organ development and morphogenesis. In contrast, GO categories identified as significant by mRNA-seq data tended towards genes involved in cell cycling and cell division. Together, the findings suggest that differentially expressed genes in the ECO syndrome are found in pathways involved in the proliferation and regulation of cell cycle.
We also evaluated the top GO categories in ECO-affected cells versus unaffected cells using KEGG pathway analysis for the microarray and mRNA-seq data sets (Tables 5 and 6, respectively). Interestingly, although several overrepresented pathways were the same using data from each platform, the most significant pathway -found from the mRNA-seq data -was the cell cycle. Overall, KEGG pathway analysis suggested downstream transcriptional consequences of the germline ICK mutation affect JAK-STAT and Wnt signalling pathways, cell adhesion and cytoskeletal structure, consistent with a role in regulation of cell proliferation.

Discussion
We used two different methods, namely cDNA microarrays and the mRNA deep sequencing platform, to profile transcriptomes of fibroblasts from patients with ECO syndrome due to the homozygous R272Q mutation in ICK. We identified a hyperproliferative phenotype for cultured ECO cells and showed differential expression of genes involved in cell growth and proliferation. We also had a unique opportunity to compare the findings of these two platforms. We found that the mRNA-seq platform was more sensitive in identifying significantly differentially expressed genes than the cDNA microarray platform. Also, correlation with qRT-PCR validation experiments of fold-changes with mRNA-seq was also superior compared to cDNA microarrays. It is interesting to note that results from cDNA microarray and qRT-PCR do not correlate well for the top 20 genes acquired from the mRNA-seq platform data, indicating that although qRT-PCR shows biological differences for these genes, their changes in expression were not appreciated using the cDNA microarray. This also implies that the cDNA microarray platform contains numerous false-negatives, which may lead to inaccurate conclusions about the transcripts expressed in cases versus controls.
Initially transcription profiling studies largely relied on hybridization-based technologies. However with the introduction of mRNA-seq technology, RNA analysis through deep sequencing is achievable on a massive scale. Although the discussion of the advances and challenges of both platforms used here is beyond the scope of this paper, we will briefly address them. The microarraybased approach to study gene expression is high throughput and relatively inexpensive; however it has a limited range of detection due to both background and saturation of signals [17] and seems largely limited in its ability to catalogue and quantify diverse RNA molecules due to the reliance on probes for pre-specified targets [18]. mRNA-seq technology, on the other hand, has highly reproducible results with relatively little technical variation and has the potential to detect and quantify RNAs with low and moderate abundance since this approach digitally counts sequence reads [19]. However, by using sequence reads for RNA quantification, other issues arise; for instance, a small number of very highly expressed genes (7%) accounts for most of the reads (75%) [20]. More specifically in this study RPKM (Reads Per Kilobase of exon model per Million mapped reads), was the unit of measurement used to quantify transcript abundance. However, this unit is biased towards larger genes and ignores the fact that that different isoforms of a gene may be of different lengths [14].
Recently, others have compared the results of deep sequencingand microarray-based transcriptional profiling in a mouse model of cardiomyopathy [21]. As we have now shown with our human transcriptome findings, those authors similarly concluded that mRNA-seq was sensitive and reliable in quantifying lowerabundance genes, which represented the majority of the regulated genes in their model.
We note that skin fibroblasts of the affected individuals were hyperproliferative in culture compared to those from normal individuals, which was consistent with the presumed role of ICK as a human cyclin-dependent kinase 2 (CDK2) member of the MAPK family. MAPKs are regulators of cell cycle and thus of cellular proliferation and apoptosis [22]. The expression experi-  ments, as well as such clinical manifestations of ECO as cleft lip and palate, polydactyly, and dysplastic organs, support a role for ICK as a regulator of cell growth. Functionally, GO analysis showed some overlap between microarray and mRNA-seq data with respect to overrepresented pathways in cells from ECO patients. However, emphasis on biological pathway involvement is based on platform selection, such that cDNA microarray concentrates on pathways with phenotypic relevance to the disorder, while the mRNA-seq platform identifies a higher proportion of upstream genes involved in cell division and DNA replication pathways. It would be of interest to examine transcription profiles in cells of other types and from other tissues in ECO patients. Thus, mRNA-seq discovered more differentially expressed genes and showed better correlation with qRT-PCR than did microarrays in cultured skin fibroblasts from ECO patients. Because of the growing use and accessibility of new genomic technologies for clinical applications, the findings show that results should be carefully interpreted, since different methods can generate very different hypotheses. Further, the findings emphasize the importance of validation of high-throughput genome-wide approaches using an independent method, such as qRT-PCR.