Robust Modeling of Differential Gene Expression Data Using Normal/Independent Distributions: A Bayesian Approach

Mojtaba Ganjali; Taban Baghfalaki; Damon Berridge

doi:10.1371/journal.pone.0123791

Abstract

In this paper, the problem of identifying differentially expressed genes under different conditions using gene expression microarray data, in the presence of outliers, is discussed. For this purpose, the robust modeling of gene expression data using some powerful distributions known as normal/independent distributions is considered. These distributions include the Student’s t and normal distributions which have been used previously, but also include extensions such as the slash, the contaminated normal and the Laplace distributions. The purpose of this paper is to identify differentially expressed genes by considering these distributional assumptions instead of the normal distribution. A Bayesian approach using the Markov Chain Monte Carlo method is adopted for parameter estimation. Two publicly available gene expression data sets are analyzed using the proposed approach. The use of the robust models for detecting differentially expressed genes is investigated. This investigation shows that the choice of model for differentiating gene expression data is very important. This is due to the small number of replicates for each gene and the existence of outlying data. Comparison of the performance of these models is made using different statistical criteria and the ROC curve. The method is illustrated using some simulation studies. We demonstrate the flexibility of these robust models in identifying differentially expressed genes.

Citation: Ganjali M, Baghfalaki T, Berridge D (2015) Robust Modeling of Differential Gene Expression Data Using Normal/Independent Distributions: A Bayesian Approach. PLoS ONE 10(4): e0123791. https://doi.org/10.1371/journal.pone.0123791

Academic Editor: James P. Brody, Irvine, UNITED STATES

Received: November 18, 2014; Accepted: March 7, 2015; Published: April 24, 2015

Copyright: © 2015 Ganjali et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper.

Funding: The research presented in this study was carried out on the High Performance Computing Cluster supported by the Computer Science department of Institute for Research in Fundamental Sciences (IPM). The authors would like to thank the Iranian National Science Foundation (INSF BS1392201) for their support.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Microarrays allow the simultaneous measurement of the expression levels of thousands of genes. This excellent data structure has inspired a completely new area of research in statistics and bioinformatics [1]. [2] considered the problem of identifying differentially expressed genes under different conditions using gene expression microarray data. They used a robust Bayesian hierarchical model for testing hypotheses relating to different gene expressions. Before this, an initial statistical treatment was given by [3] to detect differentially expressed genes. Variants of t or F-statistics were used by [4]. A modification of the t-statistic was used by [5, 6]. [7] using a permutation technique, estimated and controlled false discovery rate (FDR). An empirical Bayes approach was used by [8–10]. More fully Bayesian approaches, using Markov Chain Monte Carlo (MCMC), were applied by [11, 12]. [2, 13] introduced a hierarchical t distribution formulation which is more robust to outliers than the normal model. [2] call their model BRIDGE (Bayesian Robust Inference for Differential Gene Expression). BRIDGE (2013, http://www.rglab.org) has been recently constructed as a package consisting of several functions in R software for testing differential expressions in multiple samples.

[14] introduced a Laplace mixture model as a long-tailed alternative to the normal distribution when identifying differentially expressed genes in microarray experiments. This model permits greater flexibility than models in current use as it has the potential, at least with sufficient data, to accommodate both whole genome and restricted coverage arrays. The Laplace model appears to give some improvement in fit to data. [15] also emphasized the potential insufficiency of the Gaussian noise model in microarray data analysis and proposed different noise models. In their work the goodness of fit of noise models is quantified by a hierarchical Bayesian analysis of variance model, which predicts normalized expression values as a mixture of a Gaussian density and t-distributions with adjustable degrees of freedom. They find that, irrespective of the chosen preprocessing and normalization method, a heavy-tailed noise model is a better fit than a simple Gaussian. [16] discussed robust nonlinear differential models of gene expression. Also, variance-modeling considerations for robust data analysis were emphasized by [17].

In the current paper, an extension of the Bayesian hierarchical model of [2, 13, 18] is proposed using the family of normal/independent (N/I) distributions for errors to achieve some more robust models for analyzing gene expression microarray data. Our approach will let the data themselves determine the best robust model. This family includes normal and t distributions as well as slash, Laplace and contaminated normal distributions.

The same as [2] the model includes an exchangeable prior for the variances, allowing each gene to have a different variance and a prior for the model that allows us to detect differentially expressed genes in multiple-sample experiments. In practice, the prior is a mixture of singular Gaussian distributions. Inference is based on the posterior probabilities of differential expressions calculated from the chosen model. We call our method BRIN/IDGE (Bayesian Robust Inference using N/I family for Differential Gene Expression). Parameter estimation is carried out using Markov Chain Monte Carlo. The method is illustrated using two publicly available gene expression data sets which are fully explored in the next section. Also, some simulation studies are conducted in order to illustrate the proposed approach.

[2] compared their BRIDGE method for testing differentially expressed genes with other methods: (i) the t-test, (ii) the Bonferroni-adjusted t-test, (iii) significance analysis of microarrays (SAM, [7]), (iv) empirical Bayes lognormal-normal and (v) gamma-gamma models [8] and (vi) Efron’s empirical model [10]. In this paper, not only we will compare the performance of members of BRIN/IDGE with these six methods, but also we will compare the performance of different members of BRIN/IDGE, and will find which one provides the best fit to two-sample and multiple-sample data sets.

This article is organized as follows. Section 2 introduces the data sets and some notation. In Section 3 we give an overall view of normal/independent (N/I) distributions. In Section 4, we present the Bayesian hierarchical model using the N/I structure. In Section 5, we apply the proposed models to the two datasets introduced in Section 2 and test the differential expressions using different members of the family of N/I distributions and compare the performance of these models based on Bayesian false discovery rate (bFDR), Bayesian true negative rate (bTNR), Bayesian false negative rate (bFNR) and area under the curve (AUC). Section 6 contains the results of some simulation studies. In the final section we present some conclusions. Also, more details of members of the normal/independent (N/I) distributions and an analysis of Bayesian false discovery rate are given in appendices A and B, respectively.

2 Data

2.1 Golub data

Gene expression data (3051 genes and 38 tumor mRNA samples) are extracted from the leukemia microarray study of [19]. Pre-processing was done as described in [4]. The challenge of cancer treatment has been to target specific therapies to pathogenetically distinct tumor types in order to maximize efficacy and minimize toxicity. [19] chose acute leukemias as a test case. They classified acute leukemias as those arising from lymphoid precursors (acute lymphoblastic leukemia, ALL) or from myeloid precursors (acute myeloid leukemia, AML). The leukemia data set consisted of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis. RNA prepared from bone marrow mononuclear cells was hybridized to high-density oligonucleotide microarrays, produced by Affymetrix and contained probes for 3051 human genes. For each gene, a quantitative expression level is available. The data take the form Y_isr, i = 1, 2, …, N; s = 1, 2 and r = 1, 2, …, n_s, where Y_isr is the log transformed estimated intensity for gene i in group s from replicate r.

Fig 1 displays profiles of the log transformed estimated intensities against replicate number for the two subsets of Golub data. Also, the profiles of four randomly selected genes are also drawn (genes 10, 91, 1059 and 2280). These profiles show that, for example, gene #2280 may be identified as being expressed differentially between the two groups (ALL and AML). The profiles of genes and #191 and #1059 in the two groups have similar behavior; thus, these genes may not be identified as differentially expressed genes. Also, this figure shows that the log transformed estimated intensities for some genes (for example, gene #10) include outliers (for example, replicate 21 in the ALL group).

Download:

Fig 1. Profiles of log transformed estimated intensities from each group.

Left panel for ALL group and right panel for AML group.

https://doi.org/10.1371/journal.pone.0123791.g001

2.2 The hereditary breast cancer data

Many cases of hereditary breast cancer are due to mutations in either the BRCA1 gene or the BRCA2 gene. The histopathological changes in these cancers are often characteristic of the mutant gene. We hypothesize that the genes expressed by these two types of tumor are also distinctive, perhaps allowing us to identify cases of hereditary breast cancer on the basis of gene-expression profiles. [20] conducted a study to examine breast cancer tissues from patients carrying mutations in the predisposing genes, BRCA1 or BRCA2, or from patients not expected to carry a hereditary mutation.

[20] examined RNA from samples of primary tumors from seven carriers of the BRCA1 mutation, eight carriers of the BRCA2 mutation, and seven patients with sporadic cases of breast cancer. In these data, samples or groups refer to tissue sample types and there is no color swap. A set of 3226 genes was pre-selected by [20] by filtering the raw images. The data take the form Y_isr ≡ log₂(x_isr/ref_ir), i = 1, …, N; r = 1, …, n_s; s = 1, 2, 3, where x_isr is the intensity from gene i of the r^th (biological) replicate in group s, and ref_ir is the intensity from a common reference sample. Note that here, [20] used a reference sample because there are three groups of interest: BRCA1 mutation, BRCA2 mutation, and sporadic cases of breast cancer.

Fig 2 displays the gene-expression profiles (log ratios) against the replicate number of tumors with BRCA1 mutations, tumors with BRCA2 mutations, and sporadic tumors. This figure shows that there are some differences, particularly in terms of the variation in log ratios between breast tumors with BRCA2 mutations and those with other mutations. There is greater variation in log ratios for the BRCA2 than for the other two mutations. Four randomly selected genes are highlighted in different linestyles. They allow us to follow the behaviour of the randomly selected genes in the three groups. Also, this figure shows that some genes (for example, gene #1066) include outliers in their replications (for example, replicate 2 in group 2).

Download:

Fig 2. Profiles of log ratios from each group.

Left panel for BRCA1, middle panel for BRCA2 and right panel for sporadic case.

https://doi.org/10.1371/journal.pone.0123791.g002

3 Normal/independent (N/I) distribution

A normal/independent (N/I) distribution [21] is a stochastic representation of the random variable $Y = μ + e / \sqrt{u}$ , where μ is a location parameter, u is a positive random variable, with density g(u; ν), where ν is a scalar or random vector of parameters, and error (e) is a normally distributed random variable with mean 0 and variance σ.

Given u, Y follows a normal distribution with location parameter μ and scale parameter u⁻¹ σ. Then, the marginal distribution of Y is $f (y ∣ μ, σ, ν) = \int_{0}^{\infty} ϕ (y; μ, u^{- 1} σ) d G (u; ν),$ where ϕ(.; μ, σ) is the density function of N(μ, σ) and the G(u; ν) is the distribution function of u.

The class of N/I distributions includes the t, the slash, the contaminated normal, and the Laplace distributions. All these distributions have heavier tails those of the normal distribution, and can be used for robust inference. These distributions are described in S1 Appendix.

4 Bayesian Robust Inference using N/I family for Differential Gene Expression (BRIN/IDGE)

We consider two scenarios for modeling differential gene expression under N/I distributional assumptions. One is for a two-sample case and the other introduces for a multiple-sample case. Let Y_isr, i = 1, 2, …, N; r = 1, 2, …, n_s and s = 1, 2, …, k be the gene expression data for gene i from replicate r in sample s.

4.1 Two-group case

The simplest method for model comparison of two samples is indirect comparison and oligonucleotide arrays [2]. In this scenario, the model can be modified as follows: $Y_{i s r} = μ_{i s} + ɛ_{i s r} / \sqrt{u_{i s r}},$ where $ɛ_{i s r} ∣ τ_{ɛ i} \sim N (0, τ_{ɛ i}^{- 1})$ and U_isr ∼ g(u_isr; v_i).

In this scenario, μ_i = (μ_i1, μ_i2) is modeled with a mixture of two normal distributions as follows: (1) where τ_μ = (τ_μ1, τ_μ₂, τ_μ12). Also, $N (μ_{i 1}; 0, τ_{μ 12}^{- 1})$ means that μ_i1 follows a zero mean normal distribution with variance $τ_{μ 12}^{- 1}$ . The first component corresponds to the genes that are not differentially expressed, e.g., μ_i1 = μ_i2 so for particular gene i, the two groups share the same variance. Likewise, the second component corresponds to those genes that are differentially expressed, e.g., μ_i1 ≠ μ_i2 so we assume independent normal priors with different variances for these two components. The Bayesian framework offers the flexibility required to specify the range of N/I distributions introduced in the previous section.

Therefore, in this content, the null and alternative hypothesis tests for the i^th gene are defined as $H_{0}^{i} : μ_{i 1} = μ_{i 2}$ versus $H_{0}^{i} : μ_{i 1} \neq μ_{i 2}$ .

We have used a Bayesian structure for the above mentioned model. To carry out Bayesian inference, the specification prior distributions for the unknown parameters is necessary. The prior distributions are given as τ_ɛi ∼ Γ(1, 0.005), p ∼ Dirichlet(ϖ); ϖ = (1, 1)′, τ_μ1, τ_μ2, τ_μ12 ∼ Γ(1, 0.005), i = 1, 2, …, n. To obtain the t distribution, $U_{i s r} \sim Γ (\frac{v_{i}}{2}, \frac{v_{i}}{2})$ and the prior distribution for v_i is U(0, 100). For the contaminated normal distribution, λ_i, γ_i ∼ U(0, 1). For the slash distribution, U_isr ∼ Beta(v_i,1) and the prior distribution for v_i is Γ(1, 0.005). Finally, for the Laplace distribution, $U_{i s r}^{- 1} \sim e x p (v_{i})$ and v_i ∼ Γ(1, 0.005). All the priors are chosen to be low-informative.

4.2 Multiple-group case

Sometimes there are more than two samples and identifying differences in the expression of the same gene between more than two samples may be of interest. Let there be k samples in the study; for example, in the BRCA data, there are three groups: BRCA1, BRCA2 and sporadic cases.

In some situations, tests for complicated null hypotheses can be developed from tests for simpler null hypotheses. The union-intersection method [22] of test construction might be useful when the null hypothesis is conveniently expressed as an intersection, say H₀: θ ∈ ⋂_{γ ∈ Γ}Θ_γ, when Γ is an arbitrary index set that may be finite or infinite.

In the analysis of the gene expression data, the main null hypothesis for three groups is given by $H_{0}^{i} : μ_{i 1} = μ_{i 2} = μ_{i 3} .$ This null hypothesis can be considered as the following union-intersection test: (2) Therefore, having defined this hypothesis, one can implement all of the pairwise hypothesis tests. A gene is differentially expressed if at least one of the following hypothesis tests is rejected: ${H_{0}^{i}}^{(1)} : μ_{i 1} = μ_{i 2},$ ${H_{0}^{i}}^{(2)} : μ_{i 1} = μ_{i 3},$ ${H_{0}^{i}}^{(3)} : μ_{i 2} = μ_{i 3} .$

Now let there be k samples in the study. The main null hypothesis for k sample is given by $H_{0}^{i} : μ_{i 1} = μ_{i 2} = . . . = μ_{i k} .$

This null hypothesis can be considered as the following union-intersection test: $H_{0}^{i} : (μ_{i 1} = μ_{i 2}) \cap (μ_{i 1} = μ_{i 3}) \cap . . . \cap (μ_{i, k - 1} = μ_{i, k}) .$ Therefore, having defined this hypothesis, one can implement all of the $\frac{k (k - 1)}{2}$ pairwise hypothesis tests. A gene is differentially expressed if at least one of the following hypothesis tests is rejected: ${H_{0}^{i}}^{(1)} : μ_{i 1} = μ_{i 2},$ ${H_{0}^{i}}^{(2)} : μ_{i 1} = μ_{i 3},$ …, ${H_{0}^{i}}^{(\frac{k (k - 1)}{2})} : μ_{i, k - 1} = μ_{i k} .$ In order to address the problem of multiple comparisons when performing $\frac{k (k - 1)}{2}$ pairwise tests of hypothesis, one could apply the Bonferroni correction. This means reducing the significance level at which each test is performed from the 5% level to 1% or even 0.1%.

Thus, in this context the structure of each hypothesis test is considered to be the same as in the two-group case. The prior distributions are the same as those which were considered in Section 4.1 and all the priors are chosen to be low-informative. The definition of Bayesian false discovery rate is given in S2 Appendix.

5 Applications

5.1 The Golub data

For detecting the differentially expressed genes, model (1) under the N/I distributional assumption is applied. In the Bayesian approach, two parallel MCMC chains with different initial values are run for 20,000 iterations each. Then, we have discarded the first 15,000 iterations as pre-convergence burn-in and retained 5,000 for the posterior inference. For checking convergence of the MCMC chains, the Gelman-Rubin diagnostic test [23] is used.

Table 1 shows the results for the Golub data. In this table, and other tables in this paper, N, T, SL, CN and Lap are used as abbreviations for the normal, the Student’s t, the slash, the contaminated normal and the Laplace distributions, respectively. A diagnostic tool to identify differentially expressed genes is to compute the posterior probabilities of μ_i1 − μ_i2 ≠ 0, i = 1, 2, …, N. Table 1 shows that the model which assumes a Laplace distribution detects more genes, 983, than models with other distributional assumptions at the κ = 0.5 posterior threshold [P(μ₁ ≠ μ₂∣Data) > κ]. At this threshold, bFDR, bFNR and bTNR of the model under the Laplace distributional assumption are smaller than those for the models assuming other distributions. At posterior thresholds 0.7, 0.9 and 0.95, although bFDR for the model under the Laplace distributional assumption is the smallest one, the best fitting model based on bTNR and bFNR is the model which assumes the t distribution. Therefore, a more conservative conclusion would be to choose the t distribution.

Download:

Table 1. Number of differentially expressed genes, bFDR, bTNR and bFNR in the Golub data.

The values of bFDR, bTNR and bFNR for the best model are highlighted in bold.

https://doi.org/10.1371/journal.pone.0123791.t001

An ROC curve can be plotted using bFPR versus bTPR for the possible posterior threshold κ. The values of bFPR and bTPR, using Eqs (1)–(3) in S1 Appendix, can be estimated using the following formulae: (3) (4) This curve usually has a concave shape connecting the points (0, 0) and (1, 1).

Fig 3 shows the ROC curves under different distributional assumptions. This figure shows that the ROC curve for the model under the Laplace distribution is higher than the ROC curve for the models under the other distributional assumptions. Also, in this figure the area under the curve (AUC) for each distribution is reported. This criterion shows that the model under the Laplace distributional assumption (with the highest AUC = 0.9239) is the best fitting model.

Download:

Fig 3. ROC curve and the area under the curve (AUC) under different distributional assumptions for the Golub data.

https://doi.org/10.1371/journal.pone.0123791.g003

Also, Fig 4 summarizes the posterior probabilities from the BRIN/IDGE method using different distributional assumptions for the errors. This figure plots the posterior probabilities of μ₁ − μ₂ ≠ 0 versus the posterior difference between the mean of the two groups. This figure shows that μ₁−μ₂s are shrinking towards zero and hence the genes with small values of μ₁−μ₂ have very low posterior probabilities of differential expression.

Download:

Fig 4. Posterior probabilities against the posterior differences between μ₁ and μ₂ from the model with different distributional assumptions for the Golub data.

https://doi.org/10.1371/journal.pone.0123791.g004

Fig 5 shows the heatmap of the 983 genes that were best differentiated between the two types of tumor as determined by the Laplace distributional assumption for the errors at the 0.5 posterior threshold. This figure indicates that the detected genes may be divided in to two clusters, such that, some of the detected genes have higher levels of gene expression in the ALL sample and some of the detected genes have higher levels of gene expression in the AML sample. In comparison with existing methods, we use t-tests and Bonferroni-adjusted t-tests to detect the number of differentially expressed genes in the Golub data. The results show that, for the t-test, 1045 p-values are less than 0.05. The number of detected genes in the Bonferroni-adjusted t-tests is 98.

Download:

Fig 5. Heatmap of intensities of genes that were best differentiated between the two types of tumor for the Golub data.

https://doi.org/10.1371/journal.pone.0123791.g005

Significance Analysis of Microarrays (SAM) is a statistical method that has been developed by [7] for detecting differentially expressed genes. This method performs a two-class analysis using either a modified t-statistic or a (standardized) Wilcoxon rank statistic, and a multiclass analysis using a modified F-statistic. SAM uses regularized t-tests where the estimate of the standard deviation is regularized with a common estimate of the standard deviation and controls an estimate of the FDR value.

Let X_ij, j = 1,…, J and Y_ik, k = 1,…, K, i = 1, 2, …, n be the expression level of gene i under experimental conditions 1 and 2, respectively. In Table 2, the total number of genes declared significant is $# {i : ∣ d_{(i)} - {\overline{d}}_{(i)} ∣ > Δ}$ , where $d_{(i)} = \frac{{\overline{X}}_{i} - {\overline{Y}}_{i}}{s (i) + s_{0}}$ , ${\overline{X}}_{i}$ and ${\overline{Y}}_{i}$ are the averages of expression level for gene i under experimental conditions 1 and 2. Also, $s (i) = \sqrt{a {\sum_{j = 1}^{J} {(X_{i j} - {\overline{X}}_{i})}^{2} + \sum_{k = 1}^{K} {(Y_{i k} - {\overline{Y}}_{i})}^{2}}}$ , a = (1/J + 1/K)/(J + K − 2). The constant s₀ is chosen to minimize the coefficient of variation of d_(i), i = 1, 2, …, n (see [24], for more details). The results for this method are given in Table 2. In this table, “False” is the number of falsely called genes [7], “Called” is the number of genes called differentially expressed and FDR is the estimated FDR. In the SAM method, one has to choose the Δ value that is able to give the best compromise in terms of called genes, false genes and False Discovery Rate (FDR). In microarray analysis, it is very important to have statistically robust results, but we have to keep in mind that too small sized results are not able to describe the biological meaning of the experiment. In general, the choice of cut-off is subjective and there is no definition way of choosing it.

Download:

Table 2. The results of applying SAM to the Golub data.

“False” is the number of falsely called genes, “Called” is the number of genes called differentially expressed and FDR is the estimated FDR.

https://doi.org/10.1371/journal.pone.0123791.t002

The results show that, under 0.44276 for FDR, 2739 genes are detected as being differentially expressed (the results of using this method are obtained by using the SAM package in R). In our proposed method, the largest value for Bayesian FDR is 0.1330. As shown in Table 1, the number of differentially expressed genes in this case is 983. Thus, in Table 2, a more realistic FDR is 0.10508 with a Δ of 0.7 which results in 1248 genes being detected as differentially expressed genes. When the FDR is reduced further, Δ and the number of differentially expressed genes increased to 1248.

In Efron’s empirical model, a gene will be called differentially expressed if its posterior probability of being differentially expressed is larger than or equal to 1 − α. The results are shown in Table 3 and are obtained by using the SAM package in R. When 1 − α = 0.5, a FDR of 0.1593 results in 1466 genes being defined as differentially expressed. A more realistic FDR of 0.08855 (when 1−α = 0.7) results in 1131 genes being identified as differentially expressed. Also, the empirical Bayes lognormal-normal and gamma-gamma models, controlling the FDR at 10%, detect 650 and 861 genes, respectively. The results are obtained by using the EBarrays package in R.

Download:

Table 3. The results of applying Efron’s empirical model to the Golub data.

https://doi.org/10.1371/journal.pone.0123791.t003

5.2 The BRCA data

In this section, we analyzed the BRCA data using the model described in Section 4.2. For detecting differentially expressed genes, we have applied the union-intersection test using the BRIN/IDGE method.

The model comparison for this data set can be found in Table 4. This shows that different criteria, bFDR, bTNR and bFNR, for each κ value, have nearly the same number of differentially expressed genes. We conclude that, for the BRCA data, there is little to choose between the range of models making different distributional assumptions. Fig 6 shows the ROC curves and the AUCs for the models fitted under different distributional assumptions. This figure shows that all the models perform similarly well.

Download:

Table 4. Number of differentially expressed genes, bFDR, bTNR and bFNR in the BRCA data set.

The values of bFDR, bTNR and bFNR for the best model are highlighted in bold.

https://doi.org/10.1371/journal.pone.0123791.t004

Download:

Fig 6. ROC curve and the area under the curve (AUC) under different distributional assumptions for the BRCA data.

https://doi.org/10.1371/journal.pone.0123791.g006

Table 5 shows the estimates of the mixing probabilities for the five patterns of gene expression. This table indicates that all the models produce nearly the same probabilities p_i s, i = 1, 2, …, 5.

Download:

Table 5. Estimates of the mixing probabilities for the five patterns of gene expression for the BRCA data set.

p₁ : μ₁ = μ₂ = μ₃, p₂ : μ₁ = μ₂ ≠ μ₃, p₃ : μ₂ ≠ μ₁ = μ₃, p₄ : μ₁ ≠ μ₂ = μ₃, p₅ : μ₁ ≠ μ₂ ≠ μ₃.

https://doi.org/10.1371/journal.pone.0123791.t005

Fig 7 presents the posterior probabilities from the BRIN/IDGE method for each test using different distributional assumptions for the errors. This figure shows that differences in the mean have shrunk towards zero and hence have very low posterior probability of differential gene expression. This figure also shows that, the larger the difference in means μ₁−μ₂ the larger are the posterior probabilities.

Download:

Fig 7. Posterior probabilities against the posterior differences between μ₁ and μ₂ from the model with different distributional assumptions for the BRCA data.

https://doi.org/10.1371/journal.pone.0123791.g007

6 Simulation Studies

In this section, some simulation studies are conducted in order to illustrate the performance of our proposed methodology. In each simulation study, N simulated genes are generated and M iterations are performed.

To record a case identified by posterior probabilities as being differentially expressed, we define the following indicator variables: $I_{i k}^{M e t h o d} = {\begin{matrix} 0 & μ_{i 1} = μ_{i 2} \\ 1 & o . w . \end{matrix}$ such that $I_{i k}^{M e t h o d} = 1$ if P(μ_i1 ≠ μ_i2∣y) > 0.5. For i = 1, 2, …, N and k = 1, 2, …, M, also, for the real situation (5) In the generated data set, we let 100p% of data have different means and 100(1 − p)% of the generated data have the same means (see subsections 6.1 and 6.2 for more details). So, let p be the proportion of differentially expressed genes. The true positive rate (TPR), the false positive rate (FPR) and true discovery rate (TDR) for the k^th iteration can be calculated as follows: (6) (7) (8) Averaging across all iterations, we have: (9) (10) (11) A receiver operating characteristic (ROC) curve is a plot of FPR versus TPR for the possible cutoffs κ [P(μ_i1 ≠ μ_i2∣y) > κ]. An ROC curve is a two-dimensional depiction of classifier performance [25]. A common method of comparing the classifiers is the area under the ROC curve (often referred to as the AUC). In our simulation study, we calculate AUC_k, k = 1, 2, …, M for each iteration and we report $A \overline{U} C = \frac{1}{M} \sum_{k = 1}^{M} A U C_{k}$ . AUC_k and (consequently $A \overline{U} C$ ) is a portion of the area of the unit square; its value will always lie between 0 and 1. The larger the value of $A \overline{U} C$ , the better is the performance of the classifier.

In order to perform a Bayesian analysis, we need a number of iterations for each gene including a number for pre-convergence burn-in. In this simulation study, the MCMC chains are run for 15,000 iterations each. Then, we discarded the first 10,000 iterations as pre-convergence burn-in and retained 5,000 for the posterior inference. More details of the approaches can be found in the following sub-sections.

6.1 Simulation study 1

In this section, a simulation study is conducted to check the performance of the proposed BRIN/IDGE method, when the real distribution of the gene expression is the contaminated normal distribution. This distribution has a bimodal form which is commonly found in gene expression data. For this purpose, a sample with N = 1000 genes is evaluated and M = 100 iterations are performed. We consider the model Y_isr = μ_is + ɛ_isr such that ɛ_isr ∼ CN(0, σ_i, ν_i), r = 1, 2, …, n_s and s = 1, 2. To generate the simulated data sets, we fix μ_is = 14, σ_i = 1 and ν_i = (λ_i, γ_i)′, λ_i = 0.1 and the two values for γ_i: 0.10 and 0.25. Also, n₁ = 27 and n₂ = 11 are considered.

To verify how the method behaves when the control group moves away from the treatment group, we choose randomly 5% of the genes in the first group. These observations are generated by the location parameter μ_i1 + δ where δ ∈ {3, 5}.

The results of this simulation study are reported in Table 6. This table presents the results of $T \overline{P} R$ , $F \overline{P} R$ , $T \overline{D} R$ and $A \overline{U} C$ under different distributional assumptions. These results show the good performance of the robust models, in particular the model which assumes the contaminated normal distribution for the errors. This table shows that the normal distribution is not able to detect the differentiated genes. Also, as δ is increased from 3 to 5, the ability of all distributions to detect differentiated genes is improved, although the reliability of the robust distributions is greater than that of the normal one. The results show that, as γ is increased from γ = 0.1 to γ = 0.25, most of the comparison criteria ( $T \overline{P} R$ , $T \overline{D} R$ and $A \overline{U} C$ ) for the normal distribution are severely reduced in value, but the robust distributions have better ability to detect differentially expressed genes.

Download:

Table 6. Results of simulation study for n₁ = 27 and n₂ = 11.

Data are generated by the contaminated normal distributional assumption for the errors. The values of $A \overline{U} C$ for the best model are highlighted in bold.

https://doi.org/10.1371/journal.pone.0123791.t006

6.2 Simulation study 2

In this subsection, as in subsection 7.1, a simulation study is conducted to check the performance of the BRIN/IDGE, method as well as the usual normal model, when data are generated from the symmetric t distribution. For this purpose, the model defined in Section 4.1, with ɛ_isr ∼ t(0, σ_i, ν_i), σ_i = 1 and ν_i = 2, is used. As in Section 7.1, μ_is = 14 (δ ∈ {3, 5}) is considered.

The results of this simulation study are summarized in Table 7. These results show that the performance of the robust models for detecting differentially expressed genes is better than that of the normal one. Also, as δ is increased from 3 to 5, the ability of the distributions to detect differentially expressed genes has improved. The results show that, when δ = 3, the slash distributional assumption for the error provides the best performance among the models but, for δ = 5, the Student’s t distribution, performs the best.

Download:

Table 7. Results of simulation study for n₁ = 27 and n₂ = 11.

Data are generated by t distributional assumption for the errors. The values of $A \overline{U} C$ for the best model are highlighted in bold.

https://doi.org/10.1371/journal.pone.0123791.t007

7 Conclusion

In this paper, we have proposed the use of robust models for detecting differentially expressed genes. For this purpose, some powerful distributions that are known as normal/independent (N/I) distributions are used. These distributions include the Student’s t, the slash, the contaminated normal and the Laplace distributions. We have applied our proposed approach in two-group and multiple-group scenarios. A union-intersection test is used for detecting differential gene expression in the multiple-group case. The source code written in R (R2OpenBUGS package) is available on “bs.ipm.ac.ir/softwares/BRIN/index.jsp”.

To investigate the performance of our proposed approach, some simulation studies have been performed. Also, two real data sets have been analyzed where the models have been compared using bFDR, bTNR, bFNR and area under the ROC curve. We have demonstrated the flexibility of robust models in identifying differentially expressed genes. In other words, a well performing model in the class of N/I models should be identified in the light of the data. As an extension, one may consider the use of the skew-normal/independent family of mdels [26] to analyze gene expression data.

Supporting Information

S1 Appendix.

https://doi.org/10.1371/journal.pone.0123791.s001

(PDF)

S2 Appendix.

https://doi.org/10.1371/journal.pone.0123791.s002

(PDF)

Acknowledgments

The research presented in this study was carried out on the High Performance Computing Cluster supported by the Computer Science department of Institute for Research in Fundamental Sciences (IPM). We would like to thank the Iranian National Science Foundation (INSF BS1392201) for their support.

Author Contributions

Conceived and designed the experiments: MG TB DB. Performed the experiments: MG TB. Analyzed the data: MG TB. Contributed reagents/materials/analysis tools: MG TB DB. Wrote the paper: MG TB DB.

References

1. Mallick BK, Gold DL, Baladandayuthapani V. Bayesian analysis of gene expression data. Wiley, Chichester, U.K. 2009.
2. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 2006; 62(1): 10–18. pmid:16542223
- View Article
- PubMed/NCBI
- Google Scholar
3. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics 1997; 2: 364–374. pmid:23014960
- View Article
- PubMed/NCBI
- Google Scholar
4. Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentiallyexpressed genes in replicated cDNA microarray experi-ments. Statistica Sinica 2002; 12: 111–139.
- View Article
- Google Scholar
5. Chu G, Narasimham B, Tibshirani R, Tusher V. SAM “significant analysis of microarrays” users guide and technical document. Stanford University. 2002.
6. Baldi P, Long A. A Bayesian framework for the analysis of microarrayexpression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001; 17: 509–519. pmid:11395427
- View Article
- PubMed/NCBI
- Google Scholar
7. Tusher V, Tibshirani R, Gilbert C. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy ofSciences USA 2001; 98: 5116–5121.
- View Article
- Google Scholar
8. Newton MC, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: Improving statisticalinference about gene expression changes from microarraydata. Journal of Computational Biology 2001; 8: 37–52. pmid:11339905
- View Article
- PubMed/NCBI
- Google Scholar
9. Kendziorski C, Newton M, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expressionprofiles. Statistics in Medicine 2003; 22: 3899–3914. pmid:14673946
- View Article
- PubMed/NCBI
- Google Scholar
10. Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 2004; 99: 96–104.
- View Article
- Google Scholar
11. Ibrahim J, Chen MH, Gray R. Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association 2002; 97: 88–99.
- View Article
- Google Scholar
12. Tadesse M, Ibrahim J, Mutter G. Identificationof differentially expressed genes in high-density oligonucleotide arrays accounting for the quantification limits ofthetechnology. Biometrics 2003; 59: 542–554. pmid:14601755
- View Article
- PubMed/NCBI
- Google Scholar
13. Gottardo R, Raftery AE, Yeung KY, Bumgarner R. Robust estimation of cDNA microarray intensities. Technical Report 438, Statistics Department, University of Washington, Seattle; 2003.
14. Bhowmick D, Davison AC, Goldstein DR, Ruffieux Y. A Laplace mixture model for identification of differential expression in microarray experiments. Biostatistics 2006; 7: 630–641. pmid:16565148
- View Article
- PubMed/NCBI
- Google Scholar
15. Posekany A, Felsenstein K, Sykacek P. Biological assessment of robust noise models in microarray data analysis. Bioinformatics 2011; 27(6): 807–81. pmid:21252077
- View Article
- PubMed/NCBI
- Google Scholar
16. Haye A, Albert J, Rooman M. Robust nonlinear differential equation models of gene expression evolution across Drosophila development. BMC Research Notes 2012; 5: 46. pmid:22260205
- View Article
- PubMed/NCBI
- Google Scholar
17. Subramaniam S, Hsiao G. Gene-expression measurement: variance-modeling considerations for robust data analysis. Nat Immunol 2012; 13: 199–203. pmid:22344273
- View Article
- PubMed/NCBI
- Google Scholar
18. Gottardo R, Li W, Evan Johnson W, Shirley Liu X. A flexible and powerful Bayesian hierarchical model for ChIP-Chip experiments. Biometrics 2008; 64: 468–478. pmid:17888037
- View Article
- PubMed/NCBI
- Google Scholar
19. Golub TR, Slonim DK, Tamayo P, Huard C. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 1999; 286: 531–537. pmid:10521349
- View Article
- PubMed/NCBI
- Google Scholar
20. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, et al. Gene expression profiles in hereditary breast cancer. New England Journal of Medicine 2001; 344: 539–548. pmid:11207349
- View Article
- PubMed/NCBI
- Google Scholar
21. Lange KL, Sinsheimer JS. Normal/independent distributions and their applications in robust regression. Journal of the American Statistical Association 1993; 2: 175–198.
- View Article
- Google Scholar
22. Casella G, Berger RL. Statistical Inference (Second Edition), Duxbury Press/Thomson Learning, Pacific Grove, CA; 2002.
23. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences, Statistical Science 1992; 7: 457–511.
- View Article
- Google Scholar
24. Zhang S. A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance. BMC Bioinformatics 2007; 8: 230. pmid:17603887
- View Article
- PubMed/NCBI
- Google Scholar
25. Fawcett T. An introduction to ROC analysis, Pattern Recognition Letters 2006; 27: 861–874.
- View Article
- Google Scholar
26. Lachos VH, Ghosh P, Arellano-Valle RB. Likelihood Based Inference for Skew-Normal/Independent Linear Mixed Model. Statistica Sinica 2010; 20: 303–322.
- View Article
- Google Scholar

[ref1] 1. Mallick BK, Gold DL, Baladandayuthapani V. Bayesian analysis of gene expression data. Wiley, Chichester, U.K. 2009.

[ref2] 2. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 2006; 62(1): 10–18. pmid:16542223
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics 1997; 2: 364–374. pmid:23014960
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentiallyexpressed genes in replicated cDNA microarray experi-ments. Statistica Sinica 2002; 12: 111–139.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Chu G, Narasimham B, Tibshirani R, Tusher V. SAM “significant analysis of microarrays” users guide and technical document. Stanford University. 2002.

[ref6] 6. Baldi P, Long A. A Bayesian framework for the analysis of microarrayexpression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001; 17: 509–519. pmid:11395427
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref7] 7. Tusher V, Tibshirani R, Gilbert C. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy ofSciences USA 2001; 98: 5116–5121.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref8] 8. Newton MC, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: Improving statisticalinference about gene expression changes from microarraydata. Journal of Computational Biology 2001; 8: 37–52. pmid:11339905
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref9] 9. Kendziorski C, Newton M, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expressionprofiles. Statistics in Medicine 2003; 22: 3899–3914. pmid:14673946
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref10] 10. Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 2004; 99: 96–104.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref11] 11. Ibrahim J, Chen MH, Gray R. Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association 2002; 97: 88–99.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref12] 12. Tadesse M, Ibrahim J, Mutter G. Identificationof differentially expressed genes in high-density oligonucleotide arrays accounting for the quantification limits ofthetechnology. Biometrics 2003; 59: 542–554. pmid:14601755
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref13] 13. Gottardo R, Raftery AE, Yeung KY, Bumgarner R. Robust estimation of cDNA microarray intensities. Technical Report 438, Statistics Department, University of Washington, Seattle; 2003.

[ref14] 14. Bhowmick D, Davison AC, Goldstein DR, Ruffieux Y. A Laplace mixture model for identification of differential expression in microarray experiments. Biostatistics 2006; 7: 630–641. pmid:16565148
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref15] 15. Posekany A, Felsenstein K, Sykacek P. Biological assessment of robust noise models in microarray data analysis. Bioinformatics 2011; 27(6): 807–81. pmid:21252077
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref16] 16. Haye A, Albert J, Rooman M. Robust nonlinear differential equation models of gene expression evolution across Drosophila development. BMC Research Notes 2012; 5: 46. pmid:22260205
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref17] 17. Subramaniam S, Hsiao G. Gene-expression measurement: variance-modeling considerations for robust data analysis. Nat Immunol 2012; 13: 199–203. pmid:22344273
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref18] 18. Gottardo R, Li W, Evan Johnson W, Shirley Liu X. A flexible and powerful Bayesian hierarchical model for ChIP-Chip experiments. Biometrics 2008; 64: 468–478. pmid:17888037
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref19] 19. Golub TR, Slonim DK, Tamayo P, Huard C. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 1999; 286: 531–537. pmid:10521349
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref20] 20. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, et al. Gene expression profiles in hereditary breast cancer. New England Journal of Medicine 2001; 344: 539–548. pmid:11207349
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref21] 21. Lange KL, Sinsheimer JS. Normal/independent distributions and their applications in robust regression. Journal of the American Statistical Association 1993; 2: 175–198.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref22] 22. Casella G, Berger RL. Statistical Inference (Second Edition), Duxbury Press/Thomson Learning, Pacific Grove, CA; 2002.

[ref23] 23. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences, Statistical Science 1992; 7: 457–511.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref24] 24. Zhang S. A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance. BMC Bioinformatics 2007; 8: 230. pmid:17603887
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref25] 25. Fawcett T. An introduction to ROC analysis, Pattern Recognition Letters 2006; 27: 861–874.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref26] 26. Lachos VH, Ghosh P, Arellano-Valle RB. Likelihood Based Inference for Skew-Normal/Independent Linear Mixed Model. Statistica Sinica 2010; 20: 303–322.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

Figures

Abstract

1 Introduction

2 Data

2.1 Golub data

2.2 The hereditary breast cancer data

3 Normal/independent (N/I) distribution

4 Bayesian Robust Inference using N/I family for Differential Gene Expression (BRIN/IDGE)

4.1 Two-group case

4.2 Multiple-group case

5 Applications

5.1 The Golub data

5.2 The BRCA data

6 Simulation Studies

6.1 Simulation study 1

6.2 Simulation study 2

7 Conclusion

Supporting Information

S1 Appendix.

S2 Appendix.

Acknowledgments

Author Contributions

References