Transcriptome Profiling of Pediatric Core Binding Factor AML

The t(8;21) and Inv(16) translocations disrupt the normal function of core binding factors alpha (CBFA) and beta (CBFB), respectively. These translocations represent two of the most common genomic abnormalities in acute myeloid leukemia (AML) patients, occurring in approximately 25% pediatric and 15% of adult with this malignancy. Both translocations are associated with favorable clinical outcomes after intensive chemotherapy, and given the perceived mechanistic similarities, patients with these translocations are frequently referred to as having CBF-AML. It remains uncertain as to whether, collectively, these translocations are mechanistically the same or impact different pathways in subtle ways that have both biological and clinical significance. Therefore, we used transcriptome sequencing (RNA-seq) to investigate the similarities and differences in genes and pathways between these subtypes of pediatric AMLs. Diagnostic RNA from patients with t(8;21) (N = 17), Inv(16) (N = 14), and normal karyotype (NK, N = 33) were subjected to RNA-seq. Analyses compared the transcriptomes across these three cytogenetic subtypes, using the NK cohort as the control. A total of 1291 genes in t(8;21) and 474 genes in Inv(16) were differentially expressed relative to the NK controls, with 198 genes differentially expressed in both subtypes. The majority of these genes (175/198; binomial test p-value < 10−30) are consistent in expression changes among the two subtypes suggesting the expression profiles are more similar between the CBF cohorts than in the NK cohort. Our analysis also revealed alternative splicing events (ASEs) differentially expressed across subtypes, with 337 t(8;21)-specific and 407 Inv(16)-specific ASEs detected, the majority of which were acetylated proteins (p = 1.5x10-51 and p = 1.8x10-54 for the two subsets). In addition to known fusions, we identified and verified 16 de novo fusions in 43 patients, including three fusions involving NUP98 in six patients. Clustering of differentially expressed genes indicated that the homeobox (HOX) gene family, including two transcription factors (MEIS1 and NKX2-3) were down-regulated in CBF compared to NK samples. This finding supports existing data that the dysregulation of HOX genes play a central role in biology CBF-AML hematopoiesis. These data provide comprehensive transcriptome profiling of CBF-AML and delineate genes and pathways that are differentially expressed, providing insights into the shared biology as well as differences in the two CBF subsets.


Introduction
FLT3/ITD Mutation (N = 14 and 19, respectively). Baseline characteristics of the patients are shown in S1 Table. RNA sequencing in pediatric AML samples RNA sequencing was performed using the Illumina platform for all 64 samples, with an average of 47 million (27,576,175,150) reads per sample. Ninety-six percent of these reads were mapped to the human reference sequence (hg19/NCBI Build 37) using the next-generation sequencing (NGS) aligner Novoalign (www.novocraft.com);~26,000 RefSeq genes were covered by at least one read and~16,500 RefSeq genes had RPKM (Reads Per Kilobase per Million mapped reads) 1 (S2 Table). Ninety percent of these mapped reads were located within gene regions, including coding, UTR, and intronic regions, and the distribution was very similar among different cytogenetic abnormalities (Fig 1).

Identification of differentially expressed genes by RNA sequencing
In order to determine differential gene expression patterns specific to different cytogenetic categories, we performed principal component analysis (PCA) (Fig 2A). The PCA using all genes successfully separated out expression profiles for samples with Inv(16), t(8;21), or NK into three distinct clusters, suggesting that cytogenetic abnormalities profoundly affected geneexpression patterns. Two patients with NK had expression profiles that clustered with those with Inv (16). Closer examination of the two cases demonstrated the presence of CBFB-MYH11 through fluorescence in situ hybridization (FISH) in 22% of the studied metaphases in one case. However, the second case did not show CBFB-MYH11 fusions through FISH or real-time polymerase chain reaction (RT-PCR). The only fusion event shared by these two cases was the intergenic fusion NDRG1-ST3GAL1, which was also found in one t (8;21) sample and in one Inv (16) sample, but not in other NK samples.   [18] to represent the t(8;21)-specific, Inv(16)specific, and normal-specific differentially expressed genes. The track from outside to inside are the symbols of differentially expressed genes with high significance (p-value < 1.0E-08); genome positions by chromosomes (black lines are cytobands); average expression level for the samples with specific cytogenetic abnormalities (yellow); average expression level for the remaining samples (pink); fold change (red: up-regulated; blue: down-regulated); and the p-values associated with the expression patterns between one subtype and the remaining samples. To identify differentially expressed genes specific to each of the cytogenetic cohorts, we performed differential expression analysis using DESeq package [17], which uses a model based on the negative binomial distribution with variance and mean linked by local regression. Comparing t(8;21) samples with the remaining samples, a total of 827 t(8;21)-specific genes were found to be differentially expressed with an adjusted p-value (multiple testing using the Benjamini-Hochberg method) of less than 0.05 ( Fig 2B). Among these, 365 genes were up-regulated, with the RUNX1T1 gene most significantly up-regulated (p = 2.21x10 -31 ; Table 1). RNA-seq reads were uniquely mapped into the entire coding regions of RUNX1T1 for the 17 samples with t (8;21), with very few reads mapping to this gene in patients with Inv (16) or NK (S1 Fig). Additionally, 462 genes were down-regulated in samples with t(8;21), with the RFX8 gene being the most under-expressed (p = 7.18 x 10 −20 ). Eight of the top 20 under-expressed genes belonged to the HOX family (Table 1).
Similarly, AML samples with Inv(16) displayed 279 genes that were differentially expressed as compared to the remaining samples, with 181 of these genes up-regulated and 98 down-regulated at the adjusted p-value of 0.05 (Fig 2C). Matrix metallopeptidase 14 (membraneinserted) (MMP14) mRNA was most significantly up-regulated (p = 7x10 -22 ). Conversely, collagen type XXIII, alpha 1 (COL23A1) mRNA was the most significantly down-regulated. Three hundred and eighty normal-specific genes were also found (Fig 2D), indicating a widespread presence of differentially expressed genes among different cytogenetic abnormalities.
RUNX1 binding sites were enriched in differentially expressed genes RUNX1 has shown to play a crucial role in haematopoiesis during embryonic development [18] and the two subunits of the core binding factors (CBFs), i.e., CBFA and CBFB, have been suggested to modify the transcriptional regulator functions of AML by either altering the normal RUNX1 transcription program, interfering with the RUNX1 assembly, or recruiting histone deacetylases and inhibiting the RUNX1 activity [19][20][21]. To study the expression of the RUNX1 targeted genes in the three cytogenetic categories, RUNX1 ChIP-Seq data in the ME-1  Table).

Genes commonly expressed in CBF AML
Because those cases referred to collectively as CBF AML share a common biology, clinical presentation, and outcome, we inquired whether the two cohorts also shared an expression profile.
To this purpose, we detected differentially expressed genes in t(8;21) and Inv(16) using NK cohort as the control. Of the total of 1567 genes that are differentially expressed in all CBF AML cases [1291 in t(8;21) and 474 in Inv (16)], compared to samples with normal karyotype (NK), 198 differentially expressed genes are shared by the two subtypes in CBF AML ( Fig 3A): 87 of these genes are up-regulated, 88 are down-regulated in both subtypes (S4 Table), while another 23 genes have opposing expression profiles (down-regulated in one subtype but upregulated in the other). More genes share expression profiles between these two subtypes (175 / 198; binomial test p-value < 10 −30 ) than do not. Furthermore, the 88 shared down-regulated genes include many HOX genes and are enriched in genes involved in morphogenesis, specifically embryonic skeletal system development ( Fig 3B). In contrast, no gene sets are significantly enriched for the 87 shared up-regulated genes.

Expression profile in samples with normal karyotype (NK)
Evaluation of distinct expression profiles of those with NK from those with t(8;21) or Inv (16) identified 175 significantly up-regulated and 205 down-regulated genes. We further studied the expression profiles for those with and without FLT3/ITD mutation in the NK cohort to determine whether FLT3/ITD mutation is associated with a specific expression pattern. Although the expression signature for those with NK was distinct from those with CBF AML, expression PCA failed to define a distinct expression profile for those with and without FLT3/ITD mutation ( Fig 4A). HOXB7 was the only gene whose expression was significantly associated with FLT3/ITD mutation (adjusted p-value 0.034) ( Fig 4B) with an adjusted p-value < 0.05.

Co-expressed genes define gene networks
Clustering of differentially expressed genes for each cytogenetic abnormality indicates the existence of co-expressed gene networks ( To study the functionality of these networks, we calculated the correlation coefficient (R) for each pair of differentially expressed genes in each subtype and used Cytoscape [22] to identify co-expressed genes with R 2 > 0.6 and to determine subsets of the co-expressed gene networks with specific molecular functions ( Fig 5B, S2C and S2D Fig). Up-regulated genes and down-regulated genes were clustered into different groups for the t(8;21)-specific differentially expressed genes. A sub-group containing 39 genes, located in the group with down-regulated genes, is enriched in the homeobox (HOX) gene family ( Fig 5B). The group includes two HOX gene clusters on human chromosomes 7p15 (HOXA) and 17q21 (HOXB), the HOX cofactor myeloid ecotropic viral integration site 1 (MEIS1) and the NK2 homeobox 3 (NKX2-3). MEIS1 is a common leukemic collaborator [23] and NKX2-3 is a homeobox transcription factor [24]. All of these HOX genes were down-regulated in the samples with t(8;21) ( Fig 5C). Most HOX genes were also downregulated in the samples with Inv(16) except for HOXB2, HOXB3, HOXB4 and MEIS1. Furthermore, although samples with NK had higher expression levels in the HOX gene family, most genes in the HOXB gene cluster and NKX2-3 had even higher expression levels when they contained the FLT3-ITD mutation. Several immunoglobulin-related gene families, including the Leukocyte Immunoglobulinlike Receptor (LIR) gene family and the Immunoglobulin Heavy (IGH) gene family, were also enriched in down-regulated genes for the t(8;21) AML samples, while no molecular processes or functions were enriched in up-regulated genes for the t(8;21) AML samples.

Alternative splicing is common in AML and is affected by acetylation
In addition to evaluating differential expression patterns, we assessed alternative splicing among the three cytogenetic cohorts. We used the Multivariate Analysis of Transcript Splicing (MATS) application [25] to identify alternative splicing characteristic of samples from the cytogenetic subtypes. MATS is a computational tool that uses a statistical model with multivariate uniform prior to detecting differential alternative splicing events using RNA sequencing data.  . The co-expression gene networks were generated using Cytoscape 2.8.3 [19]. Node color is based on the fold change of the differentially expressed gene (red: up-regulated; green: down-regulated), and node size corresponds to the degree of the node (i.e., the number of edges incident to it). (C) Gene expression of the HOX gene family for three types of cytogenetic abnormalities, where NK is separated into two groups based on the mutation of FLT3/ITD. In our study, 337 t(8;21)-specific, 407 Inv(16)-specific, and 272 NK-specific alternative splicing events were detected. Skipped exon (SE), mutually exclusive exons (MXE) and retained intron (RI) seemed to be the predominant alternative splicing events in pediatric AML samples (Fig 6A and 6B). Furthermore, MATS separated all alternative splicing events into inclusion or skipping groups based on whether the alternative exon was included or skipped in the samples. Samples with t(8;21) and NK tended to include the alternative exons (Fig 6A), while samples with Inv(16) were likely to skip them (Fig 6B). 216 genes were affected by t(8;21)-specific alternative splicing events (Most significant events involving genes including RAB10, SERF2, HNRNPC, HNRNPD, HNRPDL, HINT1, NACA, PABPC1, RPL10, RPS12, RPS27, ARPC3, EIF1, STMN1, and ARPC4). 233 and 158 genes were also affected by Inv(16)-specific and normal-specific alternative splicing events, respectively. Gene Set Enrichment Analysis (GSEA) indicated that the majority of the affected genes are acetylated proteins: 127 genes (59%; p-value = 1.5E-51) for t(8;21)-specific; 135 genes (58%; p-value = 1.8E-54) for Inv(16)-specific; and 98 genes (62%; p-value = 1.2E-42) for normal-specific, and are enriched in the KEGG pathway ribosome.

Identification of fusion transcripts in pediatric AML samples
RNA sequencing has also been successfully used to identify gene-fusion events in cancers [26,27], and many computational tools have been developed to detect these events [11,[28][29][30]. Most detection methods, however, have proved problematic because of high false-positive rates [31,32]. We used four gene-fusion detection methods-Defuse [28], Tophat-Fusion [29], FusionMap [30] and Snowshoes-FTD [11] to identify gene-fusion events in our pediatric AML samples. The number of putative fusion events identified ranged from 300 to more than 2000 for each detection method, while only a few fusion events (2%-5%) overlapped between any two methods (Fig 7A). To reduce the high rate of false positives, only 69 putative fusion events (Fig 7B; S5 Table), identified by at least two detection methods or by one method with a Chi-merDB hit [33], were used in our study. ChimerDB is a knowledgebase of fusion genes identified using bioinformatics analysis of transcript sequences based on various public resources, including GenBank, the Sanger Cancer Genome Project (CGP), OMIM, PubMed, and the Mitelman database. Fifty-one of the 69 putative fusion events (74%), were intra-chromosomal (Fig 7B), and the remaining 18 (26%) were found in inter-chromosomal junctions. Fifty-nine of the 69 identified fusions involved the coding regions of the affected gene (S4 Table). Eight putative fusion events were found in ChimerDB and six of them were previously reported in AML [34][35][36][37], suggesting that the combination of multiple gene-fusion detection methods and ChimerDB can accurately identify fusion events. The CBFB-MYH11 fusion event was identified  coverage, these cases did not meet the statistical threshold for identification. In addition, RUNX1-RUNX1T1 transcript fusions were identified in all 17 samples with clinically annotated t(8;21) (Fisher's exact test; p-value = 2.23E-16), further suggesting that the identification of putative fusion events was accurate.
There were 119 genes involved in the 69 fusion events and gene set enrichment analysis of these 119 genes using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) [38] (Huang et al. 2009) indicated that these fusion genes are enriched in genes that code for proteins which are post-translationally modified by the attachment of at least one methyl, phosphate, or acetyl group.

Discussion
Fusion transcripts resulting from genomic translocations between RUNX1-RUNX1T1 in t (8;21) and CBFB-MYH11 in Inv (16), collectively referred to as CBF AML, have similar clinical outcomes, but the similarities and differences between the two entities have not been studied in detail. We investigated the transcriptome profiles of specimens from children with AML characterized by t (8;21) or Inv (16), and also from a third subset of patients with normal karyotype (NK) in order to define transcript-expression patterns as well as isoforms and pathways that are differentially expressed in these genomic subsets and to delineate the similarities and differences in these groups of patients. Hundreds of differentially expressed genes were found in each cohort indicating a widespread presence of differential expression among different cytogenetic abnormalities. Using the NK cohort as our control, we established expression profiles for each of the two subtypes [i.e., t(8;21) and Inv (16)] in CBF AML to define genes whose expression patterns are shared or differ between the two subtypes. A large number of differentially expressed genes were identified for the subsets of t(8;21) and Inv(16) compared to the NK cohort. We then demonstrated that the majority of the shared differentially expressed genes (175 / 198; binomial test p-value < 10 −30 ) between t(8;21) and Inv (16) are consistent in the two subtypes (genes are up-or down-regulated in both subtypes) and these genes are enriched for a number of HOX genes involved in morphogenesis, specifically the development of the embryonic skeletal system. Additionally, each of the subgroups were enriched for differentially expressed genes that contained RUNX1 binding sites; 72.8% of genes in the t(8;21) samples; 73.8% of genes in the inv(16) samples and 69.0% of genes in the normal samples.
Besides karyotype-specific expression patterns, alternative splicing patterns among the cytogenetic cohorts were assessed. We identified a large number of specific alternative splicing events associated with each cytogenetic subset. Our data demonstrated that several types of splicing exist in our cohorts, with skipped exon (SE), mutually exclusive exons (MXE), and retained intron (RI) predominating. This suggests that splicing patterns in AML could be karyotype-specific. Different splicing events can modulate gene function by introducing or deleting functional gene domains or silencing genes by causing frame-shifts or introducing early-termination codon. Defining the functional isoforms that would be translated from those destined for nonsense-mediated decay is likely to prove critical in distinguishing biologically significant spliced-gene products from non-functional ones. Additionally, we demonstrated that the protein products of a significant majority of alternatively spliced genes are determined to be substrates of post-translational acetylation. Previous studies have demonstrated that acetylation strongly influences alternative splicing [39,40], implicating acetylation in the karyotypespecific differential expression of alternative splicing. Gene set enrichment analysis of the genes altered by splicing indicated varied members of the KEGG pathway ribosome are involved. This is consistent with previous studies that show the fusion proteins, RUNX1-RUNX1T1 and CBFB-MYH11, to regulate ribosomal RNA transcription [41,42]. The dysregulation of these processes is thought to aid in the fusion proteins' role in altering differentiation, proliferation and disrupting normal hematopoiesis.
We further assessed the presence of fusion transcripts found in our study population. Given the high false-positive rate for most current gene-fusion detection methods [10,31], our integration of four distinct gene-fusion detection methods with ChimerDB, a knowledgebase of fusion genes, allowed us to identify fusion transcripts more accurately than is possible with approaches that use a single method. (We accurately identified 29 out of 31 cases for those known fusion events (CBFB-MYH11 and RUNX1-RUNX1T1) in CBF AML.) In addition to identifying the 29 known fusion transcripts, we detected an additional 258 fusion transcripts (67 fusion events in 115 genes). Fusions of high interest include those involving NUP98, a nucleoporin gene that encodes a building block of the nuclear pore complex which mediates the transport of mediators of cellular function across the nuclear membrane. Three fusion variants of NUP98 were identified (NUP98-NSD1, NUP98-HOXD13, and NUP98-HMGB3). NUP98-NSD1 has been shown to co-occur with the mutation of FLT3/ITD and to be strongly associated with adverse outcomes [43], though NUP98-HOXD13 and NUP98-HMGB3 are lessknown variants whose prevalence and clinical implications are yet to be determined. Further, we identified a large number of intra-chromosomal fusions whose true functionality needs be evaluated. Some of these fusions may result from transcriptional read-throughs that may or may not be functionally significant. Studies of the protein products and functional significance of these lesions is ongoing. Our study also indicates that the identified fusion transcripts are enriched in genes that encode proteins undergoing post-translational modification by acetylation, methylation, or phosphorylation, suggesting potential functional implications. A previous study has shown that chromatin proteins and metabolic enzymes are highly represented in acetylated, methylated, or phosphorylated proteins [44], suggesting that gene fusions may profoundly affect gene expression and metabolism in pediatric AML subtypes.
This study also highlights the significance of homeobox (HOX) genes in CBF AML. Dysregulation of HOX genes contributes to the perturbation of normal hematopoiesis [45,46], and the overexpression of HOX genes in hematopoietic cells can contribute to leukemogenesis [47,48]. Our results demonstrate that the expression levels of all HOX genes were down-regulated in the t(8;21) subtype, whereas in contrast, four HOX genes (HOXB2, HOXB3, HOXB4 and MEIS1) had much higher expression levels in the Inv (16) subtype than that in the t(8;21) subtype. The potential implication of differential HOX expression in CBF AML subtypes may cooperate with the leukemogenic potential of the two fusion events (CBFB-MYH11 and RUNX1-RUNX1T1) [49]. Furthermore, given our observations of high HOX expression in patients with FLT3/ITD mutation and previous reports of association of elevated HOX expression with adverse outcomes [8,50], the hypothesis that HOX expression may mediate the evolution of resistance should be considered.
This study provides comprehensive transcriptome profiles for CBF AML subtypes alpha [t (8;21)] and beta [Inv (16)]. It delineates differential gene-expression profiles, transcript splice isoforms, and fusion transcript profiles for CBF AML subtypes, and it also identifies specific genes and pathways that may provide targets for therapeutic intervention.

Materials and Methods
Pediatric AML samples 64 diagnostic samples derived from either bone marrow (n = 59) or peripheral blood (n = 5) were used in this study. All samples were obtained by written consent from the parents/guardians of minors from three consecutive Children's Oncology

RNA preparation and sequencing
Genetic material from AML specimens was extracted using AllPrep DNA/RNA Mini Kits (Qiagen, Valencia, CA). At Hudson Alpha Institute (Huntsville, AL), 1 μg of high-quality total RNA was used for the conversion of mRNA into a cDNA library of template molecules based on mRNA capture with poly(T) magnetic beads, fragmentation, and reverse transcription to first-strand cDNA with reverse transcriptase and random primers using Illumina's TruSeq RNA Sample Prep kit (Illumina, San Diego, CA) according to the manufacturer's instructions. After adaptor ligation, each cDNA library was purified and enriched by PCR amplification; the final average fragment size, including adaptors, was 280 bases. Each library was then subjected to 50-cycle paired-end sequencing on the Illumina HiSeq, with four samples multiplexed into each flow cell lane.

Alignment of RNA-sequencing reads to the human genome
Paired-end RNA-sequencing reads were aligned to the human reference genome (hg19/NCBI Build 37). Both the human reference genome and the splicing junction sequences were combined to form the reference sequences using the USeq MakeTranscriptome program [51], and RNA-sequencing reads were aligned to the whole genome and splice junctions using Novoalign (Novocraft 2010). Novoalign used a structural-variation penalty to determine whether pairedend reads should be reported when they did not form proper fragments. Finally, the aligned reads were sorted and indexed using SAMTools [52] and were stored in the SAM/BAM format. All BAM files have been deposited at The database of Genotypes and Phenotypes (dbGaP, http://www.ncbi.nlm.nih.gov/gap) under substudy, phs000465.v10.p3, TARGET: Acute Myeloid Leukemia (AML).

Identification of differentially expressed genes
Aligned reads were annotated using the HTSeq package (http://www-huber.embl.de/users/ anders/HTSeq/). We used the HTSeq-count program to calculate the number of reads mapped in each gene based on the Homo_sapiens.GRCh37.69.gtf annotation file downloaded from ensembl.org. A union overlap resolution mode was used to remove ambiguous reads. DESeq [17] was used to calculate the p-value among samples with different cytogenetic abnormalities for each gene. We applied the Benjamini-Hochberg procedure [53] to correct multiple testing and reported genes with an adjusted p value < 0.05 as differentially expressed genes (S6 Table).

Gene set enrichment analysis and visualization
We used DAVID [38] to perform gene set enrichment analysis (GSEA) in order to associate molecular functions with the set of differentially expressed genes as well as with sets of alternative splicing and fusion genes. Furthermore, we used OmicCircos [54] and Cytoscape [22] to visualize the results of the analysis.

Detection of alternative splicing among different cytogenetic abnormalities
We used the MATS program to detect five distinct alternative splicing events. Putative alternative splicing events were identified from the RNA-sequencing data using the annotation file that HTSeq had downloaded from ensembl.org. TopHat [55] was used to identify alternative splicing events, MATS was used to calculate the p-value for each alternative splicing event, and the false-discovery rate (FDR) control was applied to find differential alternative splicing events among samples with distinct cytogenetic abnormalities (S7 Table).

Identification of fusion events
Four gene-fusion detection methods-Defuse [28], Tophat-Fusion [29], FusionMap [30] and Snowshoes-FTD [11] were used to identify gene-fusion events in the pediatric AML samples. Fusion events identified by more than one methods were chosen as putative fusion events (S8 Table). Moreover, a knowledgebase of fusion genes, ChimerDB [33], was used to include those fusion events, which were only detected by a gene-fusion detection method. Visualization of fusion events was created using an in-house Perl program, which is based on the GD graphics library and uses UCSC hg19 known gene as gene and exon reference.

RT-PCR validation for putative fusion events
RNA was reverse-transcribed using Thermo Scientific's Maxima H Minus First Strand cDNA Synthesis Kit (Thermo Fisher Scientific, Pittsburgh, PA). The resulting cDNA was used in PCR amplification of fusion junctions with primers listed in S9 Table. Fusion transcripts were verified by Sanger sequencing.  (16)-specific and normal-specific differentially expressed genes. (C-D) Co-expression gene networks for Inv(16)-specific and normal-specific differentially expressed genes. Co-expressed genes were determined based on the coefficient of determination (R 2 > 0.6). The co-expression gene network was generated using Cytoscape 2.8.3 (Smoot et al. 2011). Node color is based on the fold change of the differentially expressed gene (red: up-regulated; green: down-regulated) and node size corresponds to the degree of the node (the number of edges incident to the node). (TIF) S1