Figures
Abstract
In toxicological concentration-response studies, a frequent goal is the determination of an ‘alert concentration’, i.e. the lowest concentration where a notable change in the response in comparison to the control is observed. In high-throughput gene expression experiments, e.g. based on microarray or RNA-seq technology, concentration-response profiles can be measured for thousands of genes simultaneously. One approach for determining the alert concentration is given by fitting a parametric model to the data which allows interpolation between the tested concentrations. It is well known that the quality of a model fit improves with the number of measured data points. However, adding new replicates for existing concentrations or even several replicates for new concentrations is time-consuming and expensive. Here, we propose an empirical Bayes approach to information sharing across genes, where in essence a weighted mean of the individual estimate for one specific parameter of a fitted model and the mean of all estimates of the entire set of genes is calculated as a result. Results of a controlled plasmode simulation study show that for many genes a notable improvement in terms of the mean squared error (MSE) between estimate and true underlying value of the parameter can be observed. However, for some genes, the MSE increases, and this cannot be prevented by using a more sophisticated prior distribution in the Bayesian approach.
Citation: Kappenberg F, Rahnenführer J (2023) Information sharing in high-dimensional gene expression data for improved parameter estimation in concentration-response modelling. PLoS ONE 18(10): e0293180. https://doi.org/10.1371/journal.pone.0293180
Editor: Shamik Polley, West Bengal University of Animal and Fishery Sciences, INDIA
Received: January 18, 2023; Accepted: October 7, 2023; Published: October 20, 2023
Copyright: © 2023 Kappenberg, Rahnenführer. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The original data was published by Krug et al (2013, https://doi.org/10.1007/s00204-012-0967-3). All relevant data for reproducing the results are uploaded to the GitHub repository at https://github.com/FKappenberg/Paper-InformationSharingAcrossGenes.
Funding: FK, JR were supported (in part) by the Research Training Group “Biostatistical Methods for High-Dimensional Data in Toxicology” (RTG 2624, P1) funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation - Project Number 427806116). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
In toxicological research, concentration-response analyses are an integral part of understanding the properties of compounds. Typical endpoints include the viability of cells treated with increasing concentrations of the specific compounds, protein measurements, or gene expression measurements. Often, the goal is the determination of an ‘alert concentration’, i.e. the lowest concentration where a notable change in response in comparison to the control is observed. This alert concentration can be determined based on multiple testing against the control, or based on the fitting of a parametric curve. Examples for observation-based alert concentrations, i.e. where multiple comparisons of the actually considered concentrations against the negative control are performed, are given, among others, by the LOEC (lowest observed effect concentration) and the NOEC (no observed effect concentration) [1].
When moving to parametric modelling, calculating alert concentrations also outside of the measured concentrations is possible. One commonly used approach is given by the effective concentration (EC value), which corresponds to the concentration where a pre-specified percentage of the maximal observed effect is attained [2]. For cell viability experiments, where the measured response is given by some percentage, EC values are typically calculated in an absolute way, i.e. as the concentration, where the fitted curve attains the specific pre-defined percentage. In applications such as gene expression data, where the responses themselves do not correspond to percentages, EC values are calculated in a relative way. In these cases, a certain percentage of the overall effect, i.e. the difference in the response values between the highest and lowest measured concentration (or between the two asymptotes of a fitted model), needs to be attained by the fitted curve. In some applications, analogously to the concept of the LOEC, the lowest concentration is of interest where some effect in comparison to the control is observed. Such concentrations can be found via parametric modelling and the subsequent calculation of an alert concentration such as the LEC (lowest effective concentration, [3, 4]) or via the BMD approach (benchmark doses, [5]).
Especially for gene expression as considered endpoint, due to high-throughput technology such as microarray technology, RNA-seq or TempO-seq data [6], often many genes are considered simultaneously for only a very small number of samples. Adding additional replicates for the already chosen concentrations or even several replicates for new concentrations is very time-consuming and expensive. However, the quality of fitted curves improves with more available data.
In this work, we propose a new method, using information sharing across genes in order to improve the estimation of an alert concentration. This approach is a relaxation of the approach of common parameters. [7], e.g., have shown some improvement in statistical inference for dose-finding studies when parameters are shared across treatments. However, they restrict the information sharing to location and scale parameter and even reason that the assumption of equality for the here considered parameters might be too strong.
The alert concentration of interest considered here is the EC50, i.e. the concentration where half of the maximal observed effect is attained. In the situation of gene expression as response data, as considered in this work, the EC50 is to be understood in a relative way. In the chosen four-parametric log-logistic model (see, e.g., [2]) used for the fitting of a parametric curve, this value is directly included as a parameter. Especially for smaller data sets, a parameterization of the EC50 value on log-scale is proposed to better meet the underlying assumption of normality for the parameters. One of the here proposed approaches works via approximating the underlying distribution of the EC50 values for one data set with a mixture of normal distributions. Such an approach has been used before in [8], in a simpler way, where a mixture of two normal distributions was fitted for ED90 (here referring to doses instead of concentrations) values on log-scale, where the response value was given by the biomass of plants after treatment with different doses of a herbicide.
In order to share the information about the parameter estimate of the log EC50 across genes, an empirical Bayes approach, based on the normal-normal model, is used. In brief, the log EC50 is assumed to follow a normal distribution with mean μ, and this parameter μ is also assumed to follow a normal distribution. The posterior, given an observed value of the log EC50 for one gene, is then again given by a normal distribution, where the mean is a weighted mean of the observed value and the mean of the prior distribution. The parameters of the prior distribution for μ are estimated from the data itself, which explains the name ‘empirical Bayes’ [9].
The empirical Bayes framework is well-established in the context of high-dimensional gene data analysis, e.g. using the common limma (linear models for microarray data, [10]) approach. In this approach, a moderated t-test is performed, where the individual variance estimate of a gene is adjusted using combined information of all genes considered. In contrast to the approach proposed here, which aims at improving the estimation of an alert concentration based on a parametric model, the limma approach is used for the calculation of differentially expressed genes between two test conditions. As another example, [11] propose an empirical Bayes approach for differentially expressed genes tailored to time-course data based on microarrays.
Our approach is based on the frequentist approach to fitting the underlying dose-response model. However, Bayesian approaches for Bayesian fitting (even in an hierarchical way) of dose-response models exist, as presented e.g. in [12–14].
This paper is structured as follows: First, the underlying parametric model for the nonlinear curve-fitting and the numerical procedure to perform this fitting are introduced. Then, the empirical Bayes approach for sharing information across genes is proposed, in three versions, based on three different assumptions for estimating the prior distribution. In a controlled simulation study, these three methods were compared to a baseline method without shared information between genes, i.e. where only individual curve-fitting per gene is performed. As target variable, the parameter from the assumed concentration-response relationship denoting the EC50 was used. The quality of the methods was assessed in terms of the mean squared error between the estimated parameter value and the known real, underlying parameter value. Finally, the Bayesian approaches were applied to the real data case study [15], which was also used as basis for the simulation study.
Materials and methods
Statistical methods
A variety of models exists for describing the relationship between the concentration x and a response values y, e.g. the family of log-logistic models, the family of log-normal models, and Weibull models [16, 17]. When assuming a sigmoidal form of the relationship, a popular model is the four parameter log-logistic (4pLL) model. For the concentration x with x ≥ 0 and a parameter vector ϕ ≔ (b, c, d, e) with e ≥ 0, this model is defined as
(1)
Often, especially for small data sets, the re-parameterization
is used [2]. The parameters c and d correspond to the lower and upper asymptote of the curve, b is a parameter proportional to the slope of the curve, and e is the concentration at which the half-maximal effect is attained. This concentration is also called the EC50, the effective concentration where 50% of the maximal observed effect is observed. Since this is a meaningful concentration where a relevant change in gene expression can be observed, in the following, this parameter in its logarithmic parameterization will in the following be the target estimate.
For x1, …, xp the concentration values (equal concentrations are allowed) and y1, …, yp the corresponding observed response values, it is assumed that yi is the observation of a normally distributed random variable Yi with mean f(xi, ϕ) and fixed variance σ2 for i = 1, …, p. The parameters ϕ are estimated via minimizing the following sum of squared errors:
This is achieved using a numerical Quasi-Newton method. (1 − α)-confidence intervals for the parameters are obtained in the typical way by calculating
where ϕi is one of the parameters
, and K is the 1 − α/2-quantile of a t-distribution with p − 4 degrees of freedom [2].
The alert concentration of interest is the log-transformed EC50, i.e. the parameter in the parameterization of the 4pLL model from Eq (1). Only this parameter from the 4pLL model is thus considered for the Bayesian information sharing. The random variable X corresponding to the parameter
is assumed to follow a normal distribution for the Bayesian information-sharing approach:
where a normal distribution is assumed as prior distribution for parameter μ, specifically
It follows that the posterior distribution for μ|x, where x is the observed value
for one specific gene, is a normal distribution with
(2)
Thus, the resulting mean of the posterior distribution is a weighted mean of the original observation x and the prior mean μ0. (1 − α)-credible intervals are obtained via the α/2 and the 1 − α/2 quantiles of the posterior distribution.
An empirical Bayes approach is chosen, where the prior parameters μ0 and τ2 are also estimated from the data. A total number of n genes is assumed to be considered simultaneously, yielding individual estimates , which then can be used for estimating the prior parameters. Specifically, two approaches for estimating the prior parameters are considered here: In the maximum-likelihood approach (ML-approach), μ0 and τ2 are estimated via the empirical mean and the empirical variance of all estimates
, respectively. The second approach considers a robust estimation of the prior parameters (robust approach): μ0 is estimated as the median of
. For the robust estimation of τ2, the median absolute deviation (MAD) of
is calculated and multiplied with the factor 1.4826 to ensure consistency for the here assumed normal distribution. The result of this multiplication is then squared. The parameter σ2 is individually calculated as the squared standard error of the estimates
for all genes.
Additionally, a more complex but also more flexible prior is considered: It is assumed that the empirical prior follows a mixing distribution of five normal distributions. For the estimation of the prior parameters, therefore the estimation of a mixing model of the form
with
is required. Here, yj, j = 1 …, n denote the observed values and Ψ = (θ1, …, θ5, λ1, …, λ5)⊤ the parameter vector. Each fi denotes the density function of a normal distribution with parameter vector
. It is necessary to both estimate the parameters θi of the individual distributions, as well as the mixing parameters λi. This is achieved by employing the expectation maximization algorithm (EM algorithm), which alternates between the assignment of the observations to the classes, which here are distributions, and the estimation of the parameters of the distribution [18], see [19] for a detailed description of the algorithm.
With a mixture of 5 normal distributions as prior distribution for X|μ, in the normal-normal model, the posterior is again a mixture of 5 normal distributions. It has a closed form, which is a mixture of normal distributions as in Eq 2, where additionally posterior values for the mixing parameters need to be calculated.
The three proposed approaches, with respective assumed prior distributions and estimation of the prior parameter values, are summarized in Table 1.
Real data case study
The real data example, which is also the basis for the plasmode simulation study, is a case study that was conducted to investigate the development of human embryonic stem cells (hESC) to neuroectoderm [15]. Cells were treated in vitro with valproic acid (VPA) at seven different concentrations (25, 150, 350, 450, 550, 800, and 1000 μM), where each concentration was assessed in three replicate experiments. Additionally, six replicates for the negative control (untreated) were measured.
The study was carried out within the ESNATS (Embryonic Stem cell-based Novel Alternative Testing Strategies) project, which was funded by the European Commission. ESNATS targeted the prediction of toxicity of drug candidates. Gene expression data was obtained with Affymetrix Microarray technology, using the GeneChip R Human Genome U133 Plus 2.0 [20]. This resulted in measurements of 54675 probe sets for each experiment. Preprocessing was performed with the robust multi array analysis (RMA) algorithm [21], which includes the three steps background correction, normalisation and summarising the data to one value. The same parameters as in the original case study [15] are used for preprocessing.
Simulation study
The empirical Bayes method for information sharing across genes was assessed in a controlled simulation study. In order to include real biological correlation structures to the simulated datasets, a so-called plasmode simulation study was conducted. The basic idea is, that in addition to retaining the true structure of an underlying dataset, the data is manipulated in a way such that true effects are known [22].
The simulation study was based on the VPA dataset from the real data case study. From all 54675 probe sets measured, those fulfilling the following two conditions were selected:
- Statistical significance: When performing a one-way analysis of variance (ANOVA) for each probe set separately, only those are considered further where the unadjusted p-value is smaller than 0.001.
- Biological relevance: The range covered by the expression values needs to be at least log2(1.5) ≈ 0.585, and the direction of the profile needs to be unambiguous. The first constraint means that for at least one concentration, the absolute value of the difference in mean between the expression value for this concentration and the expression value for the control (i.e. the log2-fold change) needs to exceed log2(1.5). The second constraint means that not simultaneously for one concentration the log2-fold change is larger than log2(1.5) and for another it is smaller than −log2(1.5).
Selecting probe sets according to these criteria yields 7191 probe sets as candidates. These 7191 probe sets measured with Affymetrix technology represent genes. In the following, to avoid mixing the terms probe set and gene, and since the general concept can be also be applied to other types of gene expression measurements, we use the term gene also for probe sets.
A 4pLL model was fitted to each gene, resulting in a vector of the four parameters for each gene. These parameters were then used as true, underlying parameters. The following procedure was repeated 1000 times: The individual true 4pLL models, based on the true underlying parameters, were evaluated at the concentrations 0, 25, 150, 350, 450, 550, 800, and 1000, according to the concentrations of the real VPA dataset. Normally distributed noise with mean 0 and standard deviation 0.1 was added in six replicates to the control and in three replicates to all non-control concentrations, yielding a simulated expression dataset with 27 observations for each gene.
For each of these simulated datasets, again a 4pLL model was fitted. The corresponding fitted parameter was considered as the direct estimate of parameter
for each gene, respectively. For each simulated gene, the three approaches of the Bayes procedure (ML estimation, robust estimation, and mixing estimation, see Table 1) were applied, yielding three posterior distributions for each gene. The means of the respective posterior distributions were then considered as the Bayesian estimates of parameter
, corresponding to the three approaches.
Software
All analyses were performed in the statistical programming language R, version 4.1.2 [23]. For fitting dose-response models, the package drc, version 3.0–1, [24] was used. The Bayesian analyses were conducted using the package LearnBayes, version 2.15.1, [25], and mixing distributions were estimated using the package mixtools, version 2.0.0, [26]. For graphical display, the package ggplot2, version 3.4.0, [27] was used.
The R code and the data needed for reproducing the simulation study are available via the Github repository https://github.com/FKappenberg/Paper-InformationSharingAcrossGenes.
Results
Descriptive analysis of the VPA dataset
First, 4pLL models were fitted to the 7191 original genes. Histograms of the resulting parameter estimates are shown in S1 Fig. Estimates for the parameter , together with a normal distribution fitted to these values, are of particular interest with respect to the following analyses of the Bayes method. The estimated normal distributions, once based on the maximum-likelihood (ML) estimation via empirical mean and empirical variance (approach 1) and once based on the robust estimation via empirical median and empirical MAD (approach 2), together with a histogram of the parameter values of
, are shown in Fig 1(A). The ML estimation yields a far larger variance of the density function, with relatively heavy tails, while the robust estimation is more narrow and thus has higher density values in the middle range of the curve. However, both curves do not fit the data particularly well.
Different distributions are fitted to the set of estimates for parameter for the 7191 genes. (A) A univariate normal distribution is fitted, once estimating mean and variance (red curve), and once estimating the robust counterparts median and MAD (blue curve). (B) A mixture of 5 normal distributions is fitted with the EM algorithm, the curve shows the resulting (mixture) density function.
Via the EM algorithm, a mixture of 5 normal distributions (approach 3) was fitted to the values of parameter for the 7191 selected genes. Equal initial values for the mixing proportions were used, with starting values for the means given by the vector (6.2, 5, 8, 12, 8) and for the standard deviations by the corresponding vector (0.3, 0.5, 0.5, 1, 0.3). Using exactly five normal distributions is motivated as follows. One distribution is used for modelling the middle range of the distribution of
, two distributions are responsible for the heavy tails, respectively, and the remaining two distributions can represent any artefacts in high and low values that may be observed.
The resulting parameter values are summarized in Table 2. A visual display of the resulting density function is given in Fig 1(B), where an overall good fit of the mixture normal distribution to the histogram of can be observed, a clear improvement compared to the other two approaches. The individual density functions are displayed in S2 Fig.
Results of the simulation study
The simulation study was conducted as described above. Due to numerical problems, sometimes no 4pLL model could be fitted to a simulated expression data set, or missing values were obtained in the Bayes method for the mixture normal distributions due to non-convergence of the EM algorithm. For the analyses, only those genes were considered for which missing values occurred in at most 200 out of the 1000 simulation runs, leaving 6891 genes in the analysis.
In order to compare the results from the Bayes approaches to the direct estimation of parameter , mean squared errors (MSE) across the simulation runs were calculated. For this, the respective direct or Bayesian estimates were compared to the true underlying parameter
used for the simulation. For each gene, only those simulation runs were considered, in which for the respective compared method an estimate was obtained.
The resulting MSEs for the comparison of the direct estimation and the Bayes approach based on ML estimation are shown in Fig 2(A). Each point represents one gene, and the red diagonal line represents the case where the MSEs for both methods are equal. Points are colored in orange, if both MSEs are smaller than 0.1, i.e. the MSEs are negligibly small. Points are colored in green, if the MSE based on the new Bayes approach is smaller than the MSE based on the direct approach by a factor of at least 1.1, and colored in black in the opposite case. The remaining points, i.e. when none of the approaches performs notably better than the other, are colored in blue.
Points represent genes, and they are colored according to the comparison of the performance of the two approaches (orange: MSEs very small, blue: MSEs comparable, green: MSE smaller for Bayes approach, black: MSE smaller for direct approach). The underlying parameter values are colored in the same way.
To understand which factors influence the result of the MSE comparison, in Fig 2(B) the underlying, shape-defining parameters b and of the 4pLL model used for the simulation are colored according to the results of the MSE comparison. The Bayes method performs worse than the direct method for gene with a comparatively large value of parameter
(black points), and better for genes with a value of parameter b close to zero, i.e. for genes with a rather flat slope (green points).
Corresponding results for the robust estimation of the prior distribution, and for the flexible estimation of the prior distribution with a mixture model are shown in S3 and S4 Figs.
The effect of the three Bayes approaches in comparison to the direct approach is quantified by the number of genes with low, better, similar, or worse MSE results, see Table 3. The ML approach yields a slightly larger number of improvements, compared to the robust and the mixing distribution approach, whose results are very similar to each other.
The parameters for the corresponding prior distributions in all 1000 simulation runs are shown in S5 Fig for the ML and the robust approach and in S6 Fig for the mixing distribution approach. Briefly, as observed for the original data set, the ML estimates are larger than the corresponding robust estimates, both for the mean value and for the standard deviation.
To assess the strength of the effect of the Bayes procedures, MA-type plots for all three different versions are shown in Fig 3 ([28], adapted from [29]). In these plots, on the x-axis, the product of the resulting MSEs for the direct estimation and the respective Bayes approach is displayed, and on the y-axis, the ratio of these MSEs is plotted. Both the product and the ratio are shown on log-scale. Colors are obtained from the comparison of the MSEs, i.e. as in Table 3. An MSE improved by the Bayes approach corresponds to a negative ratio, and more extreme values of the ratio correspond to comparatively far better results.
On the x-axis, the product of the resulting MSEs for the direct estimation and the respective Bayes approach is displayed, and on the y-axis, the ratio of these MSEs is plotted, both on log-scale. Colors are obtained from the comparison of the MSEs, i.e. as in Table 3.
Above it was reported (Table 3) that the ML-approach leads to a larger number of genes with improved MSE. However, the plots in Fig 3 demonstrate that for many genes the robust and the mixing distribution approach lead to a more extreme improvement. Especially for the robust Bayes approach, the ratio of the MSEs becomes very small for some genes, indicating that in terms of the MSE, the Bayes approach yields a very strong improvement.
In the data set analyzed here, the maximum concentration value, for which gene expression values were measured, is 1000. Since log(1000) = 6.91, all values of that are larger than 6.91 correspond to curves where the inflection point is estimated at a larger concentration than the maximum tested concentration. This indicates an overall unfeasible and unreasonable curve fit, thus, these cases should be interpreted with caution anyways.
In Fig 4, the same MA-type plots as before are shown, but now restricted to those genes where the true underlying value of is smaller than 6.91. The Bayes procedure, however, is still based on the entire set of genes.
In comparison to Fig 3, here only genes with true underlying value of parameter smaller than 6.91 are considered.
Both from the plots of the distribution of the true underlying parameter (Fig 2, S3 and S4 Figs) and from the restricted MA-type plots it can be seen, that not considering genes with an unreasonably high value of
leads to far fewer black and blue dots in the plot. This means that the number of genes for which the estimation of
is deteriorated is clearly reduced. However, some genes that previously showed an improvement in the Bayes method are now also no longer considered, but this applies mostly to genes with only small to moderate improvement, i.e. with a value of the ratio close to 0.
Next, briefly, coverage probabilities (CP) of the credibility intervals for parameter are compared, between the direct estimation and the Bayes approach with ML estimation. Fig 5(A) shows a histogram of the CP for the direct estimation and a comparison between these and the ones for the Bayes approach with ML estimation. A confidence level of 0.95 was used for calculating confidence intervals, but it turns out that for the direct estimation the CPs are as low as 0.6. Fig 5(B) shows a scatterplot for the comparison of the CPs with those obtained with the Bayes ML approach. Only for a small subset of genes (colored black), the CPs for the Bayes ML approach, the CPs are considerably lower. For those genes, MSE was higher for the Bayes ML approach. However, very similar CP values can be observed for those genes for which the MSE was very small in both approaches (red) or even smaller in the Bayes approach (green). Thus, an improvement of MSE does not come at a cost of lower CP.
(A) Histogram of CPs for direct estimation, the vertical red line indicates the confidence level 0.95. (B) Comparison of CPs for direct and ML Bayes estimation. Colors of points are the same as in Fig 2 (orange: MSEs very small, blue: MSEs comparable, green: MSE smaller for Bayes approach, black: MSE smaller for direct approach).
Application
The four methods for parameter estimation in 4pLL models (direct, ML (Bayes), robust (Bayes) and mixing distribution (Bayes)) were applied directly to the data from the real case study to compare the resulting estimates. Fig 6 displays scatterplots of the estimates for parameter , comparing the three Bayes approaches against the direct approach, respectively. The blue horizontal line indicates the mean of the prior normal distribution, estimated via the mean (ML estimate, value 6.895) or the median (robust estimate, value 6.383) of all direct estimates, respectively. Since the mixed prior is based on five normal distributions, indicating one overall mean would not be meaningful in this case.
The dashed blue lines indicate the mean and median, respectively, used for the specification of the prior normal distribution.
The shrinkage of the direct estimates towards the prior mean values is clearly visible for all three approaches. Shrinkage is overall stronger for the robust and the mixed estimation than for the ML estimation. Larger values tend to be shrunken more than smaller values, indicating a generally larger uncertainty in the estimation of parameter when this value is large.
Discussion and conclusion
The calculation of an alert concentration as a part of general concentration-response analyses is an important aspect in toxicological research. Observation-based alert concentrations include concepts as the LOEC (lowest observed effec concentration) or the NOEC (no observed effect concentration) [1] or alert concentrations such as ED-values [2] or benchmark doses [5]. Especially when considering high-throughput gene expression experiments, many (often thousands of) dose-response curves are considered simultaneously, and some similarity between the resulting alert concentrations is biologically plausible.
Thus, in this paper, a method is proposed to share information across genes in order to improve the estimation of the EC50, i.e. the concentration where half of the maximal effect is obtained, as alert concentration. The method is based on an empirical Bayes approach, where the estimate for the logarithmic EC50 is assumed to follow a normal distribution with mean μ which is assumed to follow a normal distribution as well. Parameters of this prior distribution are directly estimated from the data, either using the empirical mean and empirical variance, the median and the MAD, or an approach via a mixture of 5 normal distributions. For these modelling approaches, the posterior given an observed value for the logarithmic EC50 again follows a normal distribution with a mean value that is essentially a weighted mean of the prior mean and the observed value.
Results of a controlled plasmode simulation study, based on data from a real data case study [15], showed that the estimate of the log EC50 is improved in terms of MSE for a notable number of genes, for each of the three approaches to calculate the prior. The maximum likelihood prior leads to the largest number of improvements, while the individual improvements are generally larger for the other two approaches. This does not come at a cost of lower coverage probabilities for the corresponding confidence and credible intervals. When excluding genes from the analysis for which an initial fit with the chosen log-logistic model does not lead to a plausible result, the ratio of genes with clear improvement is further increased.
In this work, only the functional form of a log-logistic model to describe the relationship between concentration and response was considered. In [30] it is shown, based on the same data set as considered here, that in a two-step multiple comparison and model selection procedure often also other models than the log-logistic model are chosen, such as the linear model and the non-monotone Beta model. Some of these models directly include the EC50 as a parameter; for others, this alert concentration needs to be derived analytically or even numerically. In principle, however, an extension of the approach for information sharing proposed here to other functional relationships is easily possible. Instead of selecting one specific model, model averaging approaches can lead to more accurate estimates [31]. In addition, it is possible to consider other alert concentrations that are not directly included as parameters in the model, such as other ED-values, the BMD or the LEC.
Using the assumption of normal distributed response data for fitting a parametric model via the explained methodology, we implicitly restricted the method for application to appropriately pre-processed microarray data. However, the popular RNA-seq and TempO-Seq technologies lead to counts as outcomes. These are assumed to follow a negative binomial distribution, as seen in the R-package DESeq2 [32] which is used for determining differentially expressed genes. Since fitting concentration-response curves for count data is also possible, e.g. using the R-package drc [2], an extension of the approach proposed here to other types of data is possible.
One further possible extension of our approach is given by directly incorporating the sharing of information between genes in a hierarchical Bayesian model, thus avoiding the two-step procedure. However, the approach proposed here benefits from its intuitive interpretability and the easy implementation using standard packages.
Supporting information
S1 Fig. Parameter estimates for the VPA dataset.
Estimates of the four parameters of the 4pLL model, fitted to the 7191 genes selected from the VPA dataset, are shown by histograms.
https://doi.org/10.1371/journal.pone.0293180.s001
(PDF)
S2 Fig. Mixture of 5 normal distributions fitted to the values of parameter
.
The corresponding parameter estimates are summarized in Table 2, where the first component corresponds to the red curve, the second component to the green curve, the third component to the blue curve, the fourth component corresponds to the turquoise curve, and the fifth component corresponds to the purple curve.
https://doi.org/10.1371/journal.pone.0293180.s002
(PDF)
S3 Fig. MSE for the direct estimation and the Bayes approach based on robust estimation (A), together with true underlying parameters b and
(B).
The resulting values of the MSE are colored according to the comparative performance of the two approaches. The underlying parameter values are colored in the same way.
https://doi.org/10.1371/journal.pone.0293180.s003
(PDF)
S4 Fig. MSE for the direct estimation and the Bayes approach based on the mixing distribution as prior (A), together with true underlying parameters b and
(B).
The resulting values of the MSE are colored according to the comparative performance of the two approaches. The underlying parameter values are colored in the same way.
https://doi.org/10.1371/journal.pone.0293180.s004
(PDF)
S5 Fig. Parameters from the prior normal distribution, estimated in an empirical way directly from the direct estimates for parameter
in each simulation run separately.
The two histograms on the left (A) show the values for the prior mean of the normal distribution, the two histograms on the right (B) show the values for the prior standard deviation. The parameters are estimated via ML estimation (top) or via robust estimation (bottom).
https://doi.org/10.1371/journal.pone.0293180.s005
(PDF)
S6 Fig. Parameters from the mixed prior normal distribution, estimated in an empirical way directly from the direct estimates for parameter
in each simulation run separately.
The five rows show the individual mixture components, and the columns the mixing parameter λ (left), the prior mean (middle) and the prior standard deviation (right) of the respective mixture component.
https://doi.org/10.1371/journal.pone.0293180.s006
(PDF)
Acknowledgments
The authors would like to thank Katja Ickstadt for the helpful discussion about the Bayesian approaches.
References
- 1. Delignette-Muller M.L., Forfait C., Billoir E., Charles S. A new perspective on the Dunnett procedure: Filling the gap between NOEC/LOEC and ECx concepts. Environ. Toxicol. Chem.; 2011, 30(12):2888–2891. pmid:21932292
- 2.
Ritz C., Jensen S. M., Gerhard D., Streibig J. C. Dose-Response Analysis Using R. CRC Press; 2019
- 3. Kappenberg F., Grinberg M., Jiang X., Kopp-Schneider A., Hengstler J. G., Rahnenführer J. Comparison of observation-based and model-based identification of alert concentrations from concentration–expression data. Bioinformatics; 2021, 37(14): 1990–1996. pmid:33515236
- 4. Möllenhoff K., Schorning K., Kappenberg F. Identifying alert concentrations using a model-based bootstrap approach. Biometrics; forthcoming pmid:36385693
- 5. Jensen S.M., Kluxen F.M., Ritz C. A Review of Recent Advances in Benchmark Dose Methodology. Risk Anal.; 2019, 39(19):2295–2315. pmid:31046141
- 6. Bushel P.R., Paules R.S., Auerbach S.S. A Comparison of the TempO-Seq S1500+ Platform to RNA-Seq and Microarray Using Rat Liver Mode of Action Samples. Front. Genet.; 2018, 9:485. pmid:30420870
- 7. Feller C., Schorning K., Dette H., Bermann G., Bornkamp B. Optimal Designs for Dose Response Curves with Common Parameters. Ann. Stat.; 2017, 45(5): 2102–2132, 2017.
- 8. Altop E.K., Mennan H., Streibig J.C., Budak U., Ritz C. Detecting ALS and ACCase herbicide tolerant accession of Echinochloa oryzoides (Ard.) Fritsch. in rice (Oryza sativa L.) fields. Crop Prot.; 2014, 65: 202–206
- 9. Casella G. An Introduction to Empirical Bayes Data Analysis. Am Stat; 1985, 39(2): 83–87.
- 10. Ritchie M.E., Phipson B., Wu D., Hu Y., Law C.W., Shi W., et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.; 2015, 43(7):e47–e47. pmid:25605792
- 11. Aryee M.J., Gutiérrez-Pabello J.A., Kramnik I., Maiti T., Quackenbush J. An improved empirical bayes approach to estimating differential gene expression in microarray time-course data: BETR (Bayesian Estimation of Temporal Regulation). BMC Bioinform; 2009, 10:409 pmid:20003283
- 12. Hennessey V.G., Rosner G.L., Bast R.C. Jr., Chen M.Y. A Bayesian approach to dose-response assessment and synergy and its application to in vitro dose-response studies. Biometrics; 2010, 66(4): 1275–83. pmid:20337630
- 13. Wheeler M.W., Blessinger T., Shao K., Allen B.C., Olszyk L., Davis J.A., et al. Quantitative Risk Assessment: Developing a Bayesian Approach to Dichotomous Dose–Response Uncertainty. Risk Anal.; 2020, 40: 1706–1722 pmid:32602232
- 14. Wheeler M. W., Cortiñas Abrahantes J., Aerts M., Gift J. S., Allen Davis J. Continuous model averaging for benchmark dose analysis: Averaging over distributional forms. Environmetrics; 2020, e2728
- 15. Krug AK, Kolde R, Gaspar JA, Rempel E, Balmer NV, Meganathan K, et al. Human embryonic stem cell-derived test systems for developmental neurotoxicity: a transcriptomics approach. Arch Toxicol. 2013 Jan;87(1):123–43. pmid:23179753
- 16. Holland-Letz T., Kopp-Schneider A. Optimal experimental designs for dose-response studies with continuous endpoints. Arch. Toxicol. 2015 Nov; 89(11): 2059–2068 pmid:25155192
- 17. Ritz C. Toward a unified approach to dose-response modeling in ecotoxicology. Environ. Toxicol. Chem. 2010 Jan; 29(1): 220–229 pmid:20821438
- 18. Dempster A.P., Laird N.M., Rubin D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Series B Stat Methodol. 1977; 39(1):1–38
- 19.
McLachlan G.J., Do K., Ambroise C. Analyzing Microarray Gene Expression Data. New Jersey: Wiley; 2004
- 20.
Affymetrix Design and Performance of the GeneChip® Human Genome U133 Plus 2.0 and Human Genome U133A 2.0 Arrays. Technical Report, rev 2.0 edition. 2003
- 21. Irizarry R.A., Bolstad B.M., Collin F., Cope L.M., Hoobs B., Speed T.P. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003; 31(4):e15 pmid:12582260
- 22. Vaughan L.K., Divers J., Padilla M., Redden D.T., Tiwari H.K, Pomp D., et al. The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal. 2009 Mar; 53(5):1755–1766 pmid:20161321
- 23.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/; 2021
- 24. Ritz C., Baty F., Streibig J. C., Gerhard D. Dose-Response Analysis Using R. PLOS ONE; 2015, 10(12): e0146021 pmid:26717316
- 25.
Albert J. LearnBayes: Functions for Learning Bayesian Inference. 2018. R package version 2.15.1. https://CRAN.R-project.org/package=LearnBayes
- 26. Benaglia T., Chauveau D., Hunter D.R., Young D. mixtools: An R Package for Analyzing Finite Mixture Models. J. Stat. Softw. 2009: 32(6): 1–29, http://www.jstatsoft.org/v32/i06/
- 27.
Wickham H.
ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag.; 2016
- 28. Dudoit S., Yang Y.H., Callow M.J., Speed T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin.; 2002, 12:111–139
- 29. Altman D. G., Bland J. M. Measurement in Medicine: The Analysis of Method Comparison Studies. Statistician; 1983, 32(3): 307–317.
- 30. Duda J. C., Kappenberg F., Rahnenführer J. Model selection characteristics when using MCP-Mod for dose–response gene expression data. Biom. J.; 2022, 64(5), 883–897. pmid:35187701
- 31. Schorning K., Bornkamp B., Bretz F., Dette H. Model selection versus model averaging in dose finding studies. Stat Med; 2016, 35(22):4021–4040 pmid:27226147
- 32. Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.; 2014, 15(12):550 pmid:25516281