Skip to main content
Advertisement
  • Loading metrics

Bayesian identification of differentially expressed isoforms using a novel joint model of RNA-seq data

  • Xu Shi,

    Roles Formal analysis, Methodology, Software

    Affiliation Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, United Sates of America

  • Xiao Wang,

    Roles Formal analysis, Methodology, Software

    Affiliation Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, United Sates of America

  • Lu Jin,

    Roles Formal analysis

    Affiliation Hormel Institute, University of Minnesota, Austin, Minnesota, United Sates of America

  • Leena Halakivi-Clarke,

    Roles Funding acquisition, Investigation

    Affiliation Hormel Institute, University of Minnesota, Austin, Minnesota, United Sates of America

  • Robert Clarke,

    Roles Formal analysis, Funding acquisition, Investigation, Writing – review & editing

    Affiliation Hormel Institute, University of Minnesota, Austin, Minnesota, United Sates of America

  • Andrew F. Neuwald,

    Roles Writing – review & editing

    Affiliation Institute for Genome Sciences and Department Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, Maryland, United Sates of America

  • Jianhua Xuan

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing

    xuan@vt.edu

    Affiliation Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, Virginia, United Sates of America

Abstract

We develop a Bayesian approach, BayesIso, to identify differentially expressed isoforms from RNA-seq data. The approach features a novel joint model of the sample variability and the deferential state of isoforms. Specifically, the within-sample variability and the between-sample variability of each isoform are modeled by a Poisson-Lognormal model and a Gamma-Gamma model, respectively. Using a Bayesian framework, the differential state of each isoform and the model parameters are jointly estimated by a Markov Chain Monte Carlo (MCMC) method. Extensive studies using simulation and real data demonstrate that BayesIso can effectively detect isoforms of less differentially expressed and differential transcripts for genes with multiple isoforms. We applied the approach to breast cancer RNA-seq data and uncovered a unique set of isoforms that form key pathways associated with breast cancer recurrence. First, PI3K/AKT/mTOR signaling and PTEN signaling pathways are identified as being involved in breast cancer development. Further integrated with protein-protein interaction data, pathways of Jak-STAT, mTOR, MAPK and Wnt signaling are revealed in association with breast cancer recurrence. Finally, several pathways are activated in the early recurrence of breast cancer. In tumors that occur early, members of pathways of cellular metabolism and cell cycle (such as CD36 and TOP2A) are upregulated, while immune response genes such as NFATC1 are downregulated.

Author summary

A Bayesian approach, BayesIso, is developed to identify differentially expressed isoforms from RNA-seq data. The approach features a novel joint model of the sample variability and the deferential state of isoforms; the deferential state can be estimated by a Markov Chain Monte Carlo (MCMC) method. Extensive studies using simulation and real data demonstrate that BayesIso can effectively detect isoforms of less differentially expressed and differential transcripts for genes with multiple isoforms. We applied the approach to breast cancer RNA-seq data and uncovered differential isoforms that form key pathways associated with breast cancer recurrence.

1. Introduction

RNA sequencing (RNA-seq) [13] is an important technique for transcriptome analysis of cancer cells. It allows exploration of the transcriptome at the resolution of individual bases, and, with millions of transcript reads, can quantify gene expression with high accuracy. As the cost continues to decrease, more tumor samples will likely be profiled by RNA-seq than by other current techniques.

A key aspect of cancer research is the detection of differentially expressed transcripts (or isoforms) among different types or sub-types of cancer cells [46]. While RNA-seq has the advantage of wide coverage and high resolution, many challenges remain for transcriptome analysis, such as the uncertainty of read assignments and the variability of RNA-seq data. For example, because genes often express multiple transcripts (isoforms), many of which share exons, some reads cannot be assigned unequivocally to a specific isoform. Variability in RNA-seq data can arise, for example, from transcript length bias, library size bias (the total number of sequenced reads in each sample), and sequencing GC-content and random hexamer priming biases [710]. Our investigation into variability within cancer RNA-seq data reveals that bias patterns exist but cannot yet be fully explained by known sources. Moreover, different gene transcripts may exhibit different bias patterns with varying levels of complexity [11].

Currently, differential analysis of RNA-seq data mainly focuses on between-sample variability, which models the variability among biological samples in the same group [12]. Several statistical methods, such as DESeq [13], edgeR [14], EBSeq [15] and DSS [16] perform differential analysis of RNA-seq data at the gene level using statistical models, such as negative binomial distribution, to account for the variability among samples in a phenotype group. These read-count-based methods aim to improve the overall fitting of count data and the robustness against outliers. Cuffdiff 2 [6] is one of the most popular tools for differential analysis of RNA-seq data at the isoform (or transcript) level. BAM files (the binary version of sequence aligned data) are used as input and a beta negative binomial distribution accounts for the between-sample variability and read-mapping ambiguity. Cuffdiff 2 first estimates isoform expression and then detects differentially expressed isoforms based on a statistical test. It is overly conservative for detecting differentially expressed isoforms, many of which it misses [17]. Ballgown, which works together with Cufflinks [10], improves detection by flexibly selecting several statistical models [17].

Although the above-mentioned approaches have proved useful, they fail to adequately model the within-sample variability (i.e., the variability along genomic loci) of differentially expressed isoforms, which should also be taken into consideration, as we previously demonstrated [11]. Indeed, the within-sample variability is more critical at the isoform level than the between-sample variability. Gene expression consists of multiple isoforms, which thereby increases bias complexity at different locations along the gene. Cuffdiff 2 addresses within-sample variability by incorporating a fragment bias model [18] that estimates isoform expression positional and sequence-specific biases. However, the parameters of this model are estimated globally by assuming that transcripts of similar lengths have the same positional bias; this assumption is insufficient to account for the complex nature of observed within-sample variability. Moreover, after estimating isoform expression, Cuffdiff 2 applies a statistic test to detect differentially expressed isoforms, many of which may not reach statistical significance due to the large number of transcripts under consideration. Therefore, joint modeling of the variability of RNA-seq data and of the differential states of isoforms is needed for differential isoform identification.

Here, we describe and apply BayesIso, a Bayesian approach to differential analysis of RNA-seq data at the isoform level. BayesIso is based on a joint model of both the variability of RNA-seq data and differential states of isoforms. Specifically, BayesIso uses a Poisson-lognormal distribution [19] to model within-sample variability and distinct Gamma distributions [20] to model the differential expression of isoforms while accounting for the between-sample variability. Importantly, this models the within-sample variability as isoform specific and the dispersion of read counts for exons using isoform-specific parameters. Model parameters and differential states of isoforms are estimated jointly using a Markov Chain Monte Carlo (MCMC) procedure [21,22]. Simulation studies demonstrate that BayesIso significantly improves the identification of differentially expressed isoforms, especially on isoforms with moderately differential abundance.

When applied to breast cancer RNA-seq data, BayesIso identifies differentially expressed isoforms enriched in cell death, cell survival, and signaling pathways, and associated with breast cancer recurrence. Examination of differential isoforms uniquely identified by BayesIso reveals PI3K/AKT/mTOR signaling and PTEN signaling pathways responsible, at least in part, for the development of breast cancer recurrence, and a large protein-protein interaction network associated with Jak-STAT and Wnt signaling. These findings, along with further analyses of expression data, for proteins such as PIK3R2, AKT2, HSP90 and NFATC1, point to a role for isoforms in driving breast cancer recurrence.

2. Results

2.1 The BayesIso approach

An overview of the BayesIso approach is shown in Fig 1. As observed from real data, there are two types of variability in the RNA-seq data. The first type is ‘within-sample variability’: for an RNA-seq sample, there is a high variance of read counts arising mostly from sequencing bias along the genome. The second type is ‘between-sample variability’: among biological replicates or samples, the variance of read counts on a genomic locus is much higher than expected (i.e., the mean). A joint Poisson-Lognormal distribution models within-sample variability by representing isoform position-specific biases along the genome for each locus. A Gamma-Gamma model is used to model the differential isoform abundances of multiple samples between two classes. Specifically, differential states of the isoforms are introduced in this Gamma-Gamma model as hidden variables that control the differential isoform abundances of the samples between two classes. The joint model, in which the Poisson-Lognormal model and the Gamma-Gamma model work together, can account for the between-sample variability in addition to the within-sample variability.

thumbnail
Fig 1. Framework of the BayesIso approach.

BayesIso features a joint probabilistic model to take into account the variability in RNA-seq data and differential states of isoforms simultaneously. Specifically, a Poisson-Lognormal model is used to account for within-sample variability; a Gamma-Gamma model is used to model the isoform abundance of multiple samples (accounting for between-sample variability), embedded the differential states of isoforms as hidden variables. Finally, a Markov Chain Monte Carlo (MCMC) sampling algorithm is developed to estimate all of the model parameters and the posterior probability of the differential states.

https://doi.org/10.1371/journal.pcbi.1012750.g001

Based on the joint model, a Bayesian approach is used to estimate the posterior probability of the differential state of isoforms (the hidden variable). Since the joint model is defined by a set of parameters, a Markov Chain Monte Carlo (MCMC) sampling algorithm is used to estimate the parameters and the posterior probability of the hidden variable. The MCMC sampling process consists of Gibbs sampling [23] and Metropolis-Hasting sampling [24], generating samples from the conditional distributions. By virtue of the sampling process, the (marginal) posterior distributions of the parameters and the hidden variable can be estimated (or approximated) by the samples drawn from the MCMC sampling procedure. More details about the BayesIso approach can be found in the Methods section.

2.2 Identifying isoforms associated with breast cancer recurrence

We applied BayesIso to breast cancer data acquired by The Cancer Genome Atlas (TCGA) project [25]. The study was designed to identify the differentially expressed isoforms associated with breast cancer recurrence. 93 estrogen receptor positive (ER+) tumors from patients were collected for this study. 61 patients were still alive with follow up longer than 5 years and labeled as ‘Alive’. 32 patients were dead within 5 years and labeled as ‘Dead’. The histogram of the survival time is shown in S1 Fig. The ‘Dead’ and ‘Alive’ groups represent the ‘early recurrence’ group and the ‘late/non recurrence’ group, respectively.

We downloaded the sequencing data (Level 1) profiled by Illumina HiSeq 2000 RNA Sequencing Version 2 from the TCGA data portal, and then performed alignment using ‘TopHat 2 (TopHat v2.0.12)’ with UCSC hg19 as the reference sequence. With the isoform structure annotation file (RefSeq genes) downloaded from the UCSC genome browser database [26], we applied our method to identify differentially expressed isoforms by analyzing samples from the ‘Dead’ group vs. the ‘Alive’ group. As observed, our model captures various bias patterns along the genomic location (S2 Fig). While the overall bias pattern of all isoforms is high in the middle, different isoform subgroups are of varying bias patterns. With a threshold set to Probability >0.75, 2,299 isoforms of 1,905 genes are identified as being differentially expressed. The histogram of the estimated probability that the isoforms are differentially expressed is shown in S3 Fig. We also calculate the SNR of the identified differentially expressed isoforms. S4 Fig shows that the SNR has a mode value around −5dB, indicating that most of the identified isoforms are moderately differentially expressed. The low mean SNR value is consistent with the high variability of expression level observed across the samples. Thus, the detection power on moderately differential isoforms is critical for differential analysis of breast cancer RNA-seq data.

2.3 Key pathways associated with breast cancer recurrence

We compared BayesIso with Cuffdiff 2 and Ballgown in terms of identified differential genes. Differential genes are defined as genes with at least one differentially expressed isoform. With the criterion of p < 0.05 for Cuffdiff 2 and Ballgown, 1,719 and 5,399 genes, respectively,are identified as differentially expressed isoforms. Fig 2(a) shows the overlap and difference of the gene sets identified by the three methods. Cuffdiff 2 detects fewer differential genes than Ballgown does. Among the differential genes identified by BayesIso, 30% are uniquely identified by our method when compared with those identified by Cuffdiff 2 and Ballgown. The unique set of differential genes reveals several signaling pathways including the PI3K/AKT/mTOR signaling and PTEN signaling pathways (Fig 2(b1)) shows the PI3K/AKT/mTOR signaling pathway, the hyperactivation of which is known to be associated with tumorigenesis in ER positive breast cancer [27,28]. Moreover, PI3K and AKT are among the most commonly mutated genes in this breast cancer subtype; PIK3R2, a member of the PI3K protein family participating in the regulatory subunit, is detected by BayesIso as down-regulated in the ‘Dead’ group. The loss of expression of PIK3R2 is crucial to the hyperactivation of the PI3K/AKT/mTOR signaling pathway by regulating AKT2. AKT2 dysfunction inhibits the expression of TSC1 and TSC2 activates mTOR signaling, as indicated by the overexpression of RPS6KB1, a downstream target of mTOR. Overexpression of TSC2 and RPS6KB1 is further validated by their protein/phosphoprotein expression measured by reverse phase protein array (RPPA) on a subset of the TCGA breast cancer samples comprising 45 samples in the ‘Alive’ group and 27 samples in the ‘Dead’ group. Specifically, expression of NM_001114382, a differentially expressed isoform of TSC2, is positively correlated with its phosphoprotein expression at pT1462 (p = 0.02). Expression of NM_001272044, a differentially expressed isoform of RPS6KB1, is positively correlated with its phosphoprotein expression at pT389 (p = 0.0081). Note that the FPKM (expression) of NM_001272044 estimated by Cuffdiff 2 is not correlated with its phosphoprotein expression, indicating that the isoform expression estimated by BayesIso is more consistent with protein expression than Cuffdiff 2. The network shown in Fig 2(b2) reveals part of the PI3K/AKT signaling pathway leading to cell cycle progression. FN1 and ITGA2 are uniquely detected by BayesIso, which correlates with the overexpression of CCNE2 in the “Dead” group. Total protein expression values for FN1, ITGA2, and CCNE2 are highly correlated with the estimated expression of their respective isoforms. Fig 2(b3) shows part of the PTEN signaling, the underexpression of which results in hyperactivation of PI3K/AKT signaling in breast cancer [29,30]. While the mRNA expression of PTEN is not differential, total protein has a much lower expression level in the ‘Dead’ group as shown in the boxplots. BayesIso also detects SHC1, GRB2, and BCAR1, three critical components in PTEN signaling.

thumbnail
Fig 2. Key pathways associated with breast cancer that are uniquely identified by BayesIso.

(a) Venn diagram of identified differential genes (genes with differentially expressed isoforms) by the three methods: BayesIso, Cuffdiff 2, and Ballgown. (b) Three networks of differential genes detected by BayesIso: b1 – a network related to PI3K/AKT/mTOR signaling pathway; b2 – a network related to cell cycle progression of PI3K/AKT signaling pathway; b3 – a part of PTEN signaling pathway. The color of nodes represents the expression change between the two phenotypes: green means down-regulated in the ‘Dead’ group; red mean up-regulated in the ‘Dead’ group. Genes marked by bold circle or underlined are uniquely detected by BayesIso. Genes marked by yellow star have consistent protein/phosphoprotein expression. (c) Enrichment analysis of three networks using a time-course E2 induced MCF-7 breast cancer cell line data (collected at 10 time points: 0, 5, 10, 20, 40, 80, 160, 320, 640, 1280 minutes, with one sample at each time point): left – enrichment analysis of the three networks; right – expression of transcripts with significant pattern change.

https://doi.org/10.1371/journal.pcbi.1012750.g002

We validated the identified transcripts in the three networks (Fig 2(b)) using data from a time-course of estrogen (E2) induced transcription in MCF-7 breast cancer cells (RNA-seq data; GSE62789). Specifically, for each time point, we obtain the fold change of transcript expression in log2 scale compared with the sample at time 0, and used the mean of fold change as the test statistic for each network. We calculate the p-value for each network from a significance test where the null distributions were generated by calculating test statistics from randomly sampled gene sets of the same size of the network (100,000 iterations). Enrichment scores, defined as the negative of logarithm of p-value to base 10, are shown on the left panel of Fig 2(c). Two networks (Fig 2(b2) and (b3)) are enriched at early time points (<160 minutes). Moreover, the differential isoforms of TSC1, FN1, LAMC2, AKT2, and GRB2 exhibit significant expression pattern changes over time (Fig 2(c), right panel).

Note that the other genes identified by the three methods may also be associated with breast cancer. However, our initial analysis, as reported in the supplementary material (S1 Text (Section S1); S1 and S2 Tables), leads us to believe the pathways revealed by the large number of genes are diverse, hence hard to pin down their mechanistic involvement in the development of breast cancer recurrence.

2.4 PPI networks associated with breast cancer recurrence

We further mapped the differentially expressed genes to the Protein-Protein interaction (PPI) network from the Human Protein Reference Database (HPRD) [31], and then filtered out extremely low abundant isoforms according to the abundance relative to all of the isoforms of the same gene. With the criterion of median relative abundance >10%, 359 isoforms from 308 genes are identified as differentially expressed, among which 195 genes consist of multiple isoforms according to the annotation file with isoform structure. Furthermore, when compared with a gene-level analysis with the same criterion (‘Prob(d) > 0.75’) used to identify differentially expressed genes, 133 multiple-isoform genes are identified as differential at the isoform level but non-differential at the gene level.

Functional enrichment analysis of the 308 differentially expressed genes using Ingenuity Pathway Analysis (IPA; http://www.qiagen.com/ingenuity) reveals that many of the genes are associated with the cellular functions of proliferation, cell death, and migration. Functional enrichment analyses of the differentially expressed genes using DAVID (the Database for Annotation, Visualization and Integrated Discovery, http://david.abcc.ncifcrf.gov/home.jsp) shows that the identified gene set is enriched in several KEGG signaling pathways and functional clusters listed in Table 1.

thumbnail
Fig 3. Enrichment analysis of the identified differentially expressed isoforms overlapped with PPI network.

(a) The identified genes are categorized as single-isoform genes (genes with only one isoform) and multiple-isoform genes (genes with multiple isoforms). The multiple-isoform genes are further divided into two groups: differential at both gene-level and isoform-level, differential at the isoform level only. (b) Heatmaps of genes associated with proliferation of cells, migration of cells, and cell death, showing expression pattern change in a time-course E2 induced MCF-7 cell line data. The gene symbols of the heatmaps are color-coded according to the grouping in (a).

https://doi.org/10.1371/journal.pcbi.1012750.g003

We also validated the associated sets of isoforms on the estrogen (E2) induced time-course dataset (RNA-seq data; GSE62789). Both sets of isoforms associated with cell proliferation and migration are significantly enriched (p = 0.043 and p = 0.021, respectively); the p-value of the isoforms associated with cell death is borderline (p= 0.07). As shown in Fig 3, the expression of several isoforms, such as AKT2 and TSC1, changes significantly across time, implicating these genes and their isoforms in breast cancer development and recurrence. Details of the validation study can be found in S1 Text, Section S2.

The protein-protein interaction (PPI) networks of the differentially expressed genes are shown in Fig 4, where Fig 4(a) is the major connected network and Fig 4(b) represents small, isolated networks. In the PPI network of 308 genes, several hub genes (ESR1, BRCA1, CREBBP, ERBB2, and LCK) are known to play critical roles in breast cancer development. Also important are TNFRSF17, TNFRSF18, TNFRSF4, members of the Tumor Necrosis Factor Receptor superfamily that bind to various TRAF family members and can regulate tumor cell proliferation and death [31]. Moreover, from the functional enrichment analysis using DAVID, the genes participate in several signaling pathways including Jak-STAT, mTOR, MAPK, and Wnt signaling. Studies on the Jak-STAT and mTOR signaling pathways have established their roles in key processes that contribute to malignancy such as proliferation, apoptosis, and migration [32,33].

thumbnail
Fig 4. PPI networks of the identified differentially expressed genes.

Node color denotes the fold change: genes overexpressed in the ‘Dead’ group are shown in red; genes overexpressed in the ‘Alive’ group are shown in green. Node shape denotes the isoform information of the gene: round nodes are genes with single isoforms; rectangle nodes are genes with multiple isoforms but only one isoform is differentially expressed; Diamond nodes are genes with multiple differentially expressed isoforms which are all up- or down-regulated; Triangle nodes are gene with multiple differentially expressed isoforms which are regulated in the opposite direction. Node size denotes the node degree.

https://doi.org/10.1371/journal.pcbi.1012750.g004

Many genes associated with the signaling pathways are differential at the isoform level but not at the gene-level. Genes that reflect isoform-only level differential expression include PDPK1, TSC1, TSC2, PIK3R2, and AKT2 (mTOR signaling), and HSP90AA1 and HSP90AB1 (PI3K/AKT signaling). Thus, accurate and robust isoform-level differential analysis is essential and provides critical information when studying biological mechanisms associated with cancer recurrence. While HSP90AA1 has two isoforms from alternative splicing, only NM_005348 (RefSeq_id) is overexpressed in the ‘Dead’ group. HSP90AB1 has five isoforms, among which NM_007355 is detected as overexpressed in the ‘Dead’ group, whereas NM_001271971 is overexpressed in the ‘Alive’ group (Fig 5). HSP90AA1 and HSP90AB1 are Heat Shock Proteins (HSPs) that play an important role in tumorigenesis [34,35]. Overexpression of HSP90AA1 and HSP90AB1 can affect cancer cell viability and provide an escape mechanism from treatment-induced apoptosis. Functional analysis using IPA implicates the down-regulation of HSP90AB1 in activation of cell death within immune cells [36]. Collectively, these findings strongly implicate the changes in different isoform expression patterns in several key functions that directly affect cancer development.

thumbnail
Fig 5. Estimated abundance of isoforms of HSP90AB1.

The box plot shows the estimated expression level of the isoforms in the samples of the two phenotypes: ‘Alive’ and ‘Dead’. Isoform 2 is detected as overexpressed in the ‘Dead’ group (‘early recurrence’); isoform 4 is detected as overexpressed in the ‘Alive’ group (‘late recurrence’).

https://doi.org/10.1371/journal.pcbi.1012750.g005

2.5 Upregulated/downregulated pathways associated with early/late recurrence

We further divided the identified genes into two groups according to their expression pattern. 125 genes are overexpressed in the ‘Dead’ group, 172 genes are overexpressed in the ‘Alive’ group, and the remaining 11 genes have multiple isoforms with inconsistent expression patterns. Genes overexpressed in the ‘Dead’ group are labeled as ‘up-regulated’ while those overexpressed in the ‘Alive’ group are labeled as ‘down-regulated’. We performed a functional enrichment analysis on each group using IPA for identifying the enriched pathways of the up-regulated and down-regulated gene sets, respectively (Table 2). Pathways associated with cellular metabolism, survival and cell cycle are up-regulated, while immune response genes are down-regulated in tumors that recur early.

thumbnail
Table 2. Enriched pathways on up-regulated and down-regulated genes/isoforms.

https://doi.org/10.1371/journal.pcbi.1012750.t002

CD36 is a multi-ligand cell surface transmembrane receptor that regulates apoptosis, adipocyte differentiation, cellular metabolism, immunity and angiogenesis [37]. CD36 expression, which can be regulated by estrogen and anti-estrogens [38], has been associated with mammary density and clinical outcome [39]. HSP90AA1 (HSP90) is a chaperone protein and its inhibition impaired the emergence of resistance to hormone antagonists both in cell culture and in mice [40]. Antiestrogen resistance is sustained by up-regulation of autophagy, a cellular cannibalistic process, that is closely regulated by mTOR [41,42]. Upstream of mTOR, AMPK signaling controls growth factors and energy signaling cascades including PI3K/AKT and insulin signaling [4345], and also regulates autophagy [41,42].

TOP2A (topoisomerase II alpha) is an enzyme that catalyzes the topological DNA changes needed during the multistep process of cell division [46]. In breast cancer, TOP2A expression correlates significantly with ER, Ki-67, and HER2 expression [47]. Aberrations of chromosome 17q12-q22 have been reported in breast cancer and this locus incorporates the TOP2A gene along with HER2 [48]. Several of the cytotoxic drugs routinely used in the management of advanced breast cancer target topoisomerases including doxorubicin and epirubicin [49]. Overexpression of topoisomerase may reduce the efficacy of these anthracyclines, leading to drug resistance and early recurrence in some patients [50]. Expression of NFATC1, a member of a family of transcription factors that regulate the immune system, is down-regulated in early-recurrent breast tumors [51]. This down-regulation may partly explain the decrease in T cell helper-mediated antitumor activity seen in some breast cancers [52].

3. Discussion

RNA-seq data make it possible for large-scale isoform-level differential analysis yet also post remarkable challenges due to high variability and uncertainty of read assignment at the transcript level. We have developed a Bayesian approach, BayesIso, for the identification of differentially expressed isoforms. A hierarchical model, with differential states as hidden variables, is devised to account for both between-sample variability and within-sample variability. Specifically, a Poisson-Lognormal model is used to model the within-sample variability specific to each transcript. The expression level of transcripts is modeled to follow a Gamma distribution so as to capture the between-sample variability, including both over-dispersion and under-dispersion, by the model parameters. The shape parameter of the Gamma distribution is further assumed to follow a second Gamma distribution. Differential states of the transcripts are embedded into the Gamma-Gamma model as hidden variables, affecting the distribution of transcript expressions in each group or condition.

We have applied BayesIso to breast cancer RNA-seq data to identify differentially expressed isoforms associated with breast cancer recurrence. The diverse bias patterns along transcripts and the generally low differential level have been observed from the real breast cancer data, indicating their importance in differential analysis of RNA-seq data. The differentially expressed isoforms detected by BayesIso are enriched in cell proliferation, apoptosis, and migration, uncovering the mechanism related to breast cancer recurrence. Moreover, the unique set of differential genes identified by BayesIso has helped reveal several signaling pathways such as the PI3K/AKT/mTOR signaling and PTEN signaling pathways. The identified down-regulated genes in the early recurrence group, e.g., NFATC1, participate in the immune system, which may indicate the role of immune system in breast cancer recurrence.

As a final note, it is a non-trivial task to model the sequencing bias for RNA-seq data analysis. The bias patterns are complicated and cannot be well explained by known sources. In the BayesIso method, we have used a flexible model to account for the bias independent of any particular pattern. However, we have also observed that certain bias patterns (such as bias to the 3′ end, or high in the middle) occur more frequently than others. Moreover, we have further observed that the bias patterns may be affected by the expression level. In the future work, we will incorporate certain bias patterns as prior knowledge into the model, which can help estimate the bias pattern of some isoforms more accurately hence improve the performance in differential analysis of isoforms.

4. Methods

4.1 Model description of the BayesIso approach

Let represent the observed counts that fall into the ith (1 ≤ iIg) exon region of isoform t (1 ≤ tT) of gene g (1 ≤ gG) in sample j (1 ≤ jJ). T is the number of isoforms of gene g given by the annotation information. Ig is the number of exons in gene g. G is the total number of genes. J = J1 + J2 is the total number of samples, where J1 and J2 denote the number of samples in phenotype 1 and 2, respectively. Since one gene may have multiple isoforms, yg,i,j, the observed counts in the exon region, is the combination of all potential isoforms, as defined in Eq. (1):

(1)

where is a binary value indicating whether exon i is included in isoform t of gene g. At the isoform level, we use a Poisson-Lognormal regression model to account for the within-sample variability of RNA-seq data. follows a Poisson distribution with mean :

(2)

According to the Poisson-Lognormal model [19],

(3)(4)(5)

where is the true expression level of isoform t of gene g in sample j. is the length of the ith exon weighted by the library size of sample j. Ug,t,i is a model parameter representing the within-sample variability (or dispersion) for exon i of isoform t of gene g. Thus, the dispersion of different loci, exons of the isoforms, is modeled by different parameters. Precision parameter controls the overall degree of within-sample variability.

We use a Gamma-Gamma model [53] to model the expression level βg,t,j across samples collected from two phenotypes. The differential state, as a hidden variable in this Bayesian model, affects the distribution of βg,t,j among samples in each of the two phenotypes. , a binary value, indicates the differential state of isoform of gene g, where means isoform t of gene g is differentially expressed; , otherwise. Note that the between-sample variability is captured by the Gamma distribution. From the Gamma-Gamma model, the isoform expression level βg,t,j is given by:

(6)(7)(8)(9)(10)

where α is the shape parameter; is the rate parameter that depends on differential state . If , ; if , . is further assumed to follow a Gamma distribution with shape parameter α0 and rate parameter ν. In marked contrast to existing methods like Cuffdiff 2 that uses statistical tests to identify differentially expressed isoforms, the differential states of isoforms are introduced and modeled in the proposed joint model of BayesIso. A joint estimation of the differential states with other model parameters is accomplished by a Markov Chain Monte Carlo (MCMC) sampling method as described in detail in the next section.

4.2 The MCMC algorithm used in BayesIso

Due to the complexity of the joint model, it is challenging to estimate directly the model parameters and the hidden variables (i.e., the differential states, d = []). We have designed a Markov Chain Monte Carlo (MCMC) method to estimate the parameters and the hidden variables (d). The MCMC sampling process is a combination of Gibbs sampling and Metropolis-Hasting (M-H) sampling, with which as many samples as possible can be generated or drawn from the conditional distributions. By virtue of the sampling process, the marginal posterior distributions of the parameters and the hidden variable can be approximated by the samples drawn from the MCMC sampling procedure. Next, we will describe the MCMC algorithm and the associated conditional distributions.

Based on the assumption that the expression levels of the transcripts are independent, the likelihood of the observation given all the parameters is . Thus, the conditional (posterior) distributions of the parameters and (of the Poisson-Lognormal model) can be derived as follows:

(11)(12)

Similarly, the conditional posterior distributions of the parameters β, λ, α, α0, ν and d for the Gamma-Gamma model can also be derived. The details can be found in S1 Text, Section S3.

With the conditional posterior distributions derived, the MCMC algorithm is designed with the steps for Gibbs sampling and Metropolis-Hasting (M-H) sampling. Note that M-H sampling is used to sample the parameters without conjugate priors, while Gibbs sampling is used to sample the parameters with conjugate priors. The MCMC algorithm can be summarized as follows:

INPUT: Observed read counts y, library size weighted isoform structure x, number of iterations N

OUTPUT: Estimates of all of the parameters and the differential state d in the joint Bayesian model

Algorithm.

Step 1. Initialization: each parameter is set an arbitrary value and non-informative prior knowledge is used for the parameters.

Step 2. Draw samples iteratively from the conditional distributions of parameters β, U, τ (in the Poisson-Lognormal model) and parameters λ, α, α0, ν and d (in the Gamma-Gamma model). Perform the following sampling steps for N iterations:

  • • Use Gibbs sampling to draw samples of β, τ, λ, ν from their conditional distributions that follow standard probability distributions;
  • • Use Metropolis-Hasting (M-H) sampling to draw samples of U, d, α, α0 from their conditional distributions in sequence. Since these parameters do not have conjugate priors, M-H sampling is used to approximate their posterior distributions.

Step 3. Estimate differential state d as well as other parameters β, U, τ, λ, α, α0, ν from the samples, after the burn-in period, generated from the MCMC procedure.

4.3 Performance evaluation of the BayesIso Approach

We conducted a comprehensive study to evaluate the performance of BayesIso, focusing on differential analysis of RNA-seq data at the isoform level. We ran our experiments using genes with an increasing number of isoforms, starting from genes with two isoforms. For each experiment, gene sets were randomly selected from the annotation file from the UCSC genome browser database (version: GRCh37/hg19; http://genome.ucsc.edu/). Multiple synthetic data sets with varying model parameters were generated using our simulator that produced aligned reads in the ‘BAM’ format. We compared BayesIso with two existing methods: Cuffdiff 2 (version 2.2.1) [6] and Ballgown (version 1.0.4) [54], for isoform-level differential analysis of RNA-seq data.

Specifically, the performance of BayesIso was evaluated based on its accuracies in abundance quantification and differential isoform identification, respectively. The performance of BayesIso was compared with Cufflinks and Cuffdiff 2; BayesIso exhibits a consistent and improved performance over competing methods in cases of different within-sample variability and bias pattern. S4 and S5 Tables summarize the results from the performance comparison study, showing the advantage of BayesIso over existing methods for differential analysis of isoforms. More details about the performance on abundance quantification and differential analysis of isoforms can be found in S1 Text, Sections S4 & S5

We also generated synthetic data using a RNA-seq simulator (RNAseqReadSimulator [25]) to test the performance of the competing methods. 1,000 isoforms from 500 genes were randomly selected for the experiment, where 498 isoforms were differentially expressed. Consistent with previous comparison, our method has achieved the highest overall performance measured by F-score (S8 Table). Cuffdiff 2 gives rise to a very high precision, yet the recall is very low. Note that BayesIso is of a higher detection power (with a much higher recall) on the moderately differential isoforms (−3dB < SNR < −1dB; see S1 Text, Section S6 for more details).

We further used real RNA-seq benchmark datasets to evaluate the performance of the competing methods on differential analysis of RNA-seq data. The datasets are part of the MicroArray Quality Control Project (MAQC) project for benchmarking and characterize RNA-seq technology [55,56]. Using RNA spike-ins and validated expression of 1,000 genes by qRT-PCR, the performance of BayesIso was benchmarked and compared with existing methods. The results further support that the joint model employed in BayesIso has resulted in an improved overall performance for differential analysis of isoforms as shown in S13 and S14 Figs; more details of the ROC study and the precision-recall study can be found in S1 Text, Section S7.

Supporting information

S1 Text. Supplementary material: methods, performance evaluation and breast cancer study.

https://doi.org/10.1371/journal.pcbi.1012750.s001

(DOCX)

S1 Fig. Histogram of patients’ survival time: the ‘Dead’ group is shown in red; the ‘Alive’ group is shown in blue.

https://doi.org/10.1371/journal.pcbi.1012750.s002

(TIF)

S2 Fig. Estimated bias patterns of the sequencing reads.

The mean bias pattern of all the isoforms is shown by the red curve in the up-left figure. However, different sets of isoforms exhibit varying bias patterns. The three blue curves show the mean bias patterns of different groups of isoforms. The isoforms are grouped according to their bias patterns.

https://doi.org/10.1371/journal.pcbi.1012750.s003

(TIF)

S3 Fig. Histogram of estimated probability that the isoforms are differentially expressed.

Red line denotes Prob(d=1) = 0.75.

https://doi.org/10.1371/journal.pcbi.1012750.s004

(TIF)

S4 Fig. Histogram of SNR of the differentially expressed isoforms.

https://doi.org/10.1371/journal.pcbi.1012750.s005

(TIF)

S5 Fig. Histogram of estimated abundance of TNFSF10 in samples in the two groups.

Blue bars denote the abundance in the ‘Alive’ group, and the blue curve denotes the fitting of the blue bars with a gamma distribution. Red bars denote the abundance in the ‘Dead’ group, and the red curve denotes the fitting of the red bars with a gamma distribution.

https://doi.org/10.1371/journal.pcbi.1012750.s006

(TIF)

S6 Fig. Performance comparison on abundance quantification.

(a) Different overall within-sample variability; (b) different bias patterns along the genomic location. Average correlation coefficient between the estimated abundance and the true abundance of the isoforms is used to evaluate the performance.

https://doi.org/10.1371/journal.pcbi.1012750.s007

(TIF)

S7 Fig. Histogram of the SNR of all truly differentially expressed isoforms.

Red line denotes SNR = −1 dB; green line denotes SNR = −3 dB.

https://doi.org/10.1371/journal.pcbi.1012750.s008

(TIF)

S8 Fig. Results of differential analysis using the three competing methods: (a) histogram of the probability for the isoforms to be differentially expressed estimated by BayesIso; (b) histogram of the p-value calculated by Cuffdiff 2; (c) histogram of the p-value calculated by Ballgown.

https://doi.org/10.1371/journal.pcbi.1012750.s009

(TIF)

S9 Fig. Performance comparison on abundance quantification on different groups of isoforms.

The average correlation coefficients of all of the isoforms of the three competing methods are listed by the left three bars, while the performance on the three groups of differentially expressed isoforms and the non-differential isoforms are shown by the other 4 groups of bars.

https://doi.org/10.1371/journal.pcbi.1012750.s010

(TIF)

S10 Fig. Sequencing bias along the genomic location.

(a) Four different bias patterns presented by the curves are simulated by varying the sequencing probabilities according to genomic location. (b) Estimated biased patterns indicated by of the four groups of isoforms.

https://doi.org/10.1371/journal.pcbi.1012750.s011

(TIF)

S11 Fig. Histogram of the SNR of the truly differential isoforms: red line denotes SNR = −1 dB; green line denotes SNR = −3 dB.

https://doi.org/10.1371/journal.pcbi.1012750.s012

(TIF)

S12 Fig. Results of differential analysis using the three competing methods: (a) histogram of the probability for the isoforms to be differentially expressed estimated by BayesIso; (b) histogram of the p-value calculated by Cuffdiff 2; (c) histogram of the p-value calculated by Ballgown.

https://doi.org/10.1371/journal.pcbi.1012750.s013

(TIF)

S13 Fig. Performance comparison on differential analysis using the SEQC dataset benchmarked by ERCC RNAs.

https://doi.org/10.1371/journal.pcbi.1012750.s014

(TIF)

S14 Fig. Performance comparison on differential analysis on MAQC data with TaqMan qRT-PCR measurements as benchmark: (a) overall performance evaluated by F-score; (b) performances of recall and precision, respectively.

https://doi.org/10.1371/journal.pcbi.1012750.s015

(TIF)

S1 Table. Enriched Ingenuity canonical pathways obtained from genes identified by BayesIso, Cuffdiff 2 and Ballgown.

https://doi.org/10.1371/journal.pcbi.1012750.s016

(XLSX)

S2 Table. Enriched KEGG pathways obtained from genes identified by BayesIso, Cuffdiff 2 and Ballgown.

https://doi.org/10.1371/journal.pcbi.1012750.s017

(XLSX)

S3 Table. Number of isoforms grouped according to SNR.

https://doi.org/10.1371/journal.pcbi.1012750.s018

(XLSX)

S4 Table. Performance comparison on differential analysis at varying parameters α or α0 (while other parameters are fixed).

In general, the less α is, the lower abundance of isoform is; the less is, the more differentially expressed isoforms are. The precision, recall, and F-score are calculated in terms of mean values from 5 experiments. (Note that the number of isoforms is set as 2 for the genes studied in this experiment).

https://doi.org/10.1371/journal.pcbi.1012750.s019

(XLSX)

S5 Table. Performance comparison on differential analysis at different SNR levels.

https://doi.org/10.1371/journal.pcbi.1012750.s020

(XLSX)

S6 Table. Performance comparison on differentially expressed isoform detection on genes with 3, 4, and 5 isoforms, as well as genes with single isoform.

https://doi.org/10.1371/journal.pcbi.1012750.s021

(XLSX)

S7 Table. Number of isoforms grouped according to SNR.

https://doi.org/10.1371/journal.pcbi.1012750.s022

(XLSX)

S8 Table. Performance comparison on differential analysis at different SNR levels.

https://doi.org/10.1371/journal.pcbi.1012750.s023

(XLSX)

References

  1. 1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. pmid:19015660
  2. 2. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 2011;12(2):87–98. pmid:21191423
  3. 3. Wilhelm BT, Landry J-R. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods. 2009;48(3):249–57. pmid:19336255
  4. 4. Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11(12):220. pmid:21176179
  5. 5. Eswaran J, Horvath A, Godbole S, Reddy SD, Mudvari P, Ohshiro K, et al. RNA sequencing of cancer reveals novel splicing alterations. Sci Rep. 2013;3:1689. pmid:23604310
  6. 6. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013;31(1):46–53. pmid:23222703
  7. 7. Wu Z, Wang X, Zhang X. Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics. 2011;27(4):502–8. pmid:21169371
  8. 8. Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012;13(2):204–16. pmid:22285995
  9. 9. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38(12):e131. pmid:20395217
  10. 10. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5. pmid:20436464
  11. 11. Gu J, Wang X, Halakivi-Clarke L, Clarke R, Xuan J. BADGE: a novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data. BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S6. pmid:25252852
  12. 12. Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, et al. A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One. 2014;9(8):e103207. pmid:25119138
  13. 13. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. pmid:20979621
  14. 14. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. pmid:19910308
  15. 15. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013;29(8):1035–43. pmid:23428641
  16. 16. Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013;14(2):232–43. pmid:23001152
  17. 17. Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT, et al. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol. 2015;33(3):243–6. pmid:25748911
  18. 18. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12(3):R22. pmid:21410973
  19. 19. Hu M, Zhu Y, Taylor JMG, Liu JS, Qin ZS. Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq. Bioinformatics. 2012;28(1):63–8. pmid:22072384
  20. 20. Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007;23(12):1537–44. pmid:17483504
  21. 21. Carlin BP, Chib S. Bayesian model choice via Markov chain Monte Carlo methods. J R Stat Soc Ser B Stat Methodol. 1995;57(3):473–84.
  22. 22. Gilks WR. Markov chain Monte Carlo. Encyclopedia Biostat. 2005.
  23. 23. Casella G, George EI. Explaining the Gibbs sampler. Am Stat. 1992;46(3):167–74.
  24. 24. Chib S, Greenberg E. Understanding the Metropolis-Hastings algorithm. Am Stat. 1995;49(4):327–35.
  25. 25. Li W, Jiang T. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics. 2012;28(22):2914–21. pmid:23060617
  26. 26. Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, et al. The UCSC Genome browser database: 2014 update. Nucleic Acids Res. 2014;42(Database issue):D764-70. pmid:24270787
  27. 27. Ciruelos Gil EM. Targeting the PI3K/AKT/mTOR pathway in estrogen receptor-positive breast cancer. Cancer Treat Rev. 2014;40(7):862–71. pmid:24774538
  28. 28. Paplomata E, O’Regan R. The PI3K/AKT/mTOR pathway in breast cancer: targets, trials and biomarkers. Ther Adv Med Oncol. 2014;6(4):154–66. pmid:25057302
  29. 29. DeGraffenried LA, Fulcher L, Friedrichs WE, Grünwald V, Ray RB, Hidalgo M, et al. Reduced PTEN expression in breast cancer cells confers susceptibility to inhibitors of the PI3 kinase/Akt pathway. Ann Oncol. 2004;15(10):1510–6. pmid:15367412
  30. 30. Panigrahi AR, Pinder SE, Chan SY, Paish EC, Robertson JFR, Ellis IO, et al. The role of PTEN and its signalling pathways, including AKT, in breast cancer; an assessment of relationships with other prognostic factors and with outcome. J Pathol. 2004;204(1):93–100. pmid:15307142
  31. 31. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human protein reference database-2009 update. Nucleic Acids Res. 2009;37(Database issue):D767-72. pmid:18988627
  32. 32. Azab SS. Targeting the mTOR signaling pathways in breast cancer: more than the rapalogs. J Biochem Pharmacol Res. 2013;1(2):75–83.
  33. 33. Thomas SJ, Snowden JA, Zeidler MP, Danson SJ. The role of JAK/STAT signalling in the pathogenesis, prognosis and treatment of solid tumours. Br J Cancer. 2015;113(3):365–71. pmid:26151455
  34. 34. Ozgur A, Tutar L, Tutar Y. Regulation of heat shock proteins by miRNAs in human breast cancer. Microrna. 2014;3(2):118–35. pmid:25541910
  35. 35. Cooper LC, Prinsloo E, Edkins AL, Blatch GL. Hsp90α/β associates with the GSK3β/axin1/phospho-β-catenin complex in the human MCF-7 epithelial breast cancer model. Biochem Biophys Res Commun. 2011;413(4):550–4. pmid:21925151
  36. 36. Kuo CC, Liang CM, Lai CY, Liang SM. Involvement of heat shock protein (Hsp)90 beta but not Hsp90 alpha in antiapoptotic effect of CpG-B oligodeoxynucleotide. J Immunol. 2007;178(10):6100–8. pmid:17475835
  37. 37. Silverstein RL, Febbraio M. CD36, a scavenger receptor involved in immunity, metabolism, angiogenesis, and behavior. Sci Signal. 2009;2(72):re3. pmid:19471024
  38. 38. Silva ID, Salicioni AM, Russo IH, Higgy NA, Gebrim LH, Russo J, et al. Tamoxifen down-regulates CD36 messenger RNA levels in normal and neoplastic human breast tissues. Cancer Res. 1997;57(3):378–81. pmid:9012459
  39. 39. DeFilippis RA, Chang H, Dumont N, Rabban JT, Chen Y-Y, Fontenay GV, et al. CD36 repression activates a multicellular stromal program shared by high mammographic density and tumor tissues. Cancer Discov. 2012;2(9):826–39. pmid:22777768
  40. 40. Whitesell L, Santagata S, Mendillo ML, Lin NU, Proia DA, Lindquist S, et al. HSP90 empowers evolution of resistance to hormonal therapy in human breast cancer models. Proc Natl Acad Sci U S A. 2014;111(51):18297–302. pmid:25489079
  41. 41. Cook KL, Shajahan AN, Clarke R. Autophagy and endocrine resistance in breast cancer. Expert Rev Anticancer Ther. 2011;11(8):1283–94. pmid:21916582
  42. 42. Clarke R, Cook KL, Hu R, Facey COB, Tavassoly I, Schwartz JL, et al. Endoplasmic reticulum stress, the unfolded protein response, autophagy, and the integrated regulation of breast cancer cell fate. Cancer Res. 2012;72(6):1321–31. pmid:22422988
  43. 43. Villar VH, Merhi F, Djavaheri-Mergny M, Durán RV. Glutaminolysis and autophagy in cancer. Autophagy. 2015;11(8):1198–208. pmid:26054373
  44. 44. Ávalos Y, Canales J, Bravo-Sagua R, Criollo A, Lavandero S, Quest AFG, et al. Tumor suppression and promotion by autophagy. Biomed Res Int. 2014;2014:603980. pmid:25328887
  45. 45. Feng Z. p53 regulation of the IGF-1/AKT/mTOR pathways and the endosomal compartment. Cold Spring Harb Perspect Biol. 2010;2(2):a001057. pmid:20182617
  46. 46. Watt PM, Hickson ID. Structure and function of type II DNA topoisomerases. Biochem J. 1994;303(Pt 3):681–95. pmid:7980433
  47. 47. Qiao JH, Jiao DC, Lu ZD, Yang S, Liu ZZ. Clinical significance of topoisomerase 2A expression and gene change in operable invasive breast cancer. Tumour Biol. 2015;36(9):6833–8. pmid:25846735
  48. 48. Huijsmans CJJ, van den Brule AJC, Rigter H, Poodt J, van der Linden JC, Savelkoul PHM, et al. Allelic imbalance at the HER2/TOP2A locus in breast cancer. Diagn Pathol. 2015;10:56. pmid:26022247
  49. 49. Nitiss JL. Targeting DNA topoisomerase II in cancer chemotherapy. Nat Rev Cancer. 2009;9(5):338–50. pmid:19377506
  50. 50. Sparano JA, Goldstein LJ, Childs BH, Shak S, Brassard D, Badve S, et al. Relationship between Topoisomerase 2A RNA expression and recurrence after adjuvant chemotherapy for breast cancer. Clin Cancer Res. 2009;15(24):7693–700. pmid:19996222
  51. 51. Kaunisto A, Henry WS, Montaser-Kouhsari L, Jaminet SC, Oh EY, Zhao L, et al. NFAT1 promotes intratumoral neutrophil infiltration by regulating IL8 expression in breast cancer. Mol Oncol. 2015;9(6):1140–54. pmid:25735562
  52. 52. Datta J, Berk E, Xu S, Fitzpatrick E, Rosemblit C, Lowenfeld L, et al. Anti-HER2 CD4(+) T-helper type 1 response is a novel immune correlate to pathologic response following neoadjuvant therapy in HER2-positive breast cancer. Breast Cancer Res. 2015;17(1):71. pmid:25997452
  53. 53. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW, et al. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001;8(1):37–52. pmid:11339905
  54. 54. Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT, et al. Flexible isoform-level differential expression analysis with Ballgown2014. Bioinformatics. 2014.
  55. 55. Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, et al. The MicroArray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28(8):827–38. pmid:20676074
  56. 56. MAQC Consortium, Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24(9):1151–61. pmid:16964229