Accounting for measurement error to assess the effect of air pollution on omic signals

Studies on the effects of air pollution and more generally environmental exposures on health require measurements of pollutants, which are affected by measurement error. This is a cause of bias in the estimation of parameters relevant to the study and can lead to inaccurate conclusions when evaluating associations among pollutants, disease risk and biomarkers. Although the presence of measurement error in such studies has been recognized as a potential problem, it is rarely considered in applications and practical solutions are still lacking. In this work, we formulate Bayesian measurement error models and apply them to study the link between air pollution and omic signals. The data we use stem from the “Oxford Street II Study”, a randomized crossover trial in which 60 volunteers walked for two hours in a traffic-free area (Hyde Park) and in a busy shopping street (Oxford Street) of London. Metabolomic measurements were made in each individual as well as air pollution measurements, in order to investigate the association between short-term exposure to traffic related air pollution and perturbation of metabolic pathways. We implemented error-corrected models in a classical framework and used the flexibility of Bayesian hierarchical models to account for dependencies among omic signals, as well as among different pollutants. Models were implemented using traditional Markov Chain Monte Carlo (MCMC) simulative methods as well as integrated Laplace approximation. The inclusion of a classical measurement error term resulted in variable estimates of the association between omic signals and traffic related air pollution measurements, where the direction of the bias was not predictable a priori. The models were successful in including and accounting for different correlation structures, both among omic signals and among different pollutant exposures. In general, more associations were identified when the correlation among omics and among pollutants were modeled, and their number increased when a measurement error term was additionally included in the multivariate models (particularly for the associations between metabolomics and NO2).


Introduction
Health effects of air pollution are a major public health issue and have received increasing attention over the past decades [1,2,3]. In this context, the reliable estimation of risk factors and associations between environmental exposures and health conditions requires the collection of a large amount of exposure data on a relatively high number of study subjects, which is often impractical and subject to several sources of error or imprecision. This can lead not only to the presence of bias in the estimation of parameters relevant to the study but also to inaccurate conclusions when evaluating associations among pollutants, disease risk and biomarkers. Although the presence of measurement error in such studies has been discussed in the recent literature and is now recognized as a potential problem [4,5], it is often not accounted for in standard analyses, as pointed out in [6,7]. Different approaches to the problem have been adopted and different methods and techniques are available in the literature, for instance [8] suggested a semi-parametric approach to different types of error in radiation data, [9] proposed a Bayesian hierarchical model to retrieve error-free estimates of the health effect, [10] worked on a quantification of measurement error effect via validation studies, and [11] proposed a method to account for the error in time-series studies on air pollution. Studies focusing on the effect of traffic-related air pollution (TRAP) are particularly challenging on this matter and often rely on surrogate measures of pollutants, as well as on the approximation of personal exposures. [12] identified three main sources of error in the TRAP exposure assessment, namely a) the difference between measured and true ambient exposure levels, b) the difference between aggregate personal exposure and the exposure of a given individual, which are mostly due to approximation and classified as Berkson errors, and c) the difference between ambient concentration and average personal exposure, classified as a classical error. In this context, different methods have been proposed and a wide set of measurement error techniques have been employed, including among others the simulation extrapolation algorithm [13], regression calibration and validation data [14] and Bayesian methods [9,15]. On the other hand, none of these studies was applied to omics data, as they focused on estimating disease risks and did not include any molecular data.
In the present study, we propose to apply measurement error techniques to correct for error in environmental exposures when considering their association with high-throughput molecular data. This is particularly challenging due to the high dimensionality of the data, as well as to the correlation among omics sampled from the same individual. We use a Bayesian framework to address the problem, which provides a very flexible way to account for measurement error and model different error types and dependency structures in the data. In particular, Bayesian hierarchical models seem ideal in these context, as they provide a straightforward way to include dependency between exposures, but also between different response variables. Moreover, the possibility to include prior knowledge on the error components can result in better models and more accurate estimations. Additionally, the possibility of modelling several fixed and random effects, as well as different link functions, adds flexibility and general applicability to the methods.
In this paper, we apply this approach to the Oxford Street II study, a randomized crossover trial where omics and air pollution measurements are employed to investigate the association between short-term exposure to traffic related air pollution and perturbation of different omic signals [16,17]. We implement error-corrected models in a classical measurement error framework and generalize such models to account for dependencies among pollutants, as well as among response omic variables. This provides a novel way of dealing with high-dimension omic data, by including them into a Bayesian hierarchical formulation. The possibility to model more omic signals at the same time also allows to account for dependency among signals. Moreover, the inclusion of a measurement error term, which is straightforward and flexible thanks to the hierarchical formulation, has not been proposed so far in the presence of high-throughput biological data. We implement our models using Monte Carlo Markov Chain (MCMC) in JAGS, but to increase the speed of the computation, we also use the integrated nested Laplace approximation approach (INLA) [18], which has recently been used to implement measurement error models, for example in [19] and [20].
The remainder of this paper is structured as follows: we first describe the study and the model to assess the association between different air pollutants and omic measurements, focusing on metabolic pathways. The paper then illustrates the Bayesian hierarchical model we formulate to account for measurement error by including a classical error (see Section 3 for definition and theoretical consideration on classical measurement error). We expand such model to a multi-response model, accounting for a dependency structure among different omic signals, and to a multi-variate model to account for dependency among different pollutants. We then show the results based on the data set from the Oxford Street II study and finally conclude with several discussion points and potential expansion of the proposed method.

The study
The data we use here stem from the Oxford Street II Study, a randomized crossover trial within the EXPOsOMICS consortium [21]. In this study, 60 volunteers walked for two hours in a traffic-free area (Hyde Park) and in a busy shopping street (Oxford Street) of London. The walking experiments were performed on non-rainy weekdays only, from November to March, to avoid confounding from rain or pollen. Participants were divided into three groups: 1) healthy volunteers (n = 20) with a normal lung function and without a history of ischemic heart disease (IHD); 2) subjects with chronic obstructive pulmonary disease (COPD) (n = 20), without a history of IHD; and 3) subjects with clinically stable IHD over the past six months (n = 20) without COPD. Healthy participants were recruited using advertising in public areas within the Royal Brompton Hospital. Individuals with COPD or IHD were recruited from existing databases or outpatient respiratory and cardiology clinics at the Royal Brompton and Harefield NHS Foundation Trust. All current smokers or former smoker for less than 12 months were excluded, as well as people with high occupational levels of TRAP. Inclusion criteria based on age and forced vital capacity were also applied (see [16]). Information on age, sex, body mass index (BMI), blood pressure, distance walked, diet and medication use was collected for each participant.
For each individual and each exposure session, three blood samples were collected: two hours before walking, two hours after walking and 24 hours after walking; at the same time TRAP measurements were taken, namely on nitrogen dioxide (NO 2 ), particulate matters PM 10 and PM 2.5 and black carbon (CBLK). Such measurements are likely to suffer from classical measurement error, as they were collected using a portable size-selective airborne particle sampler and therefore rely on each participant's precision when performing and reporting the experiment, besides being potentially subject to the sources of error in TRAP identified by [12] and reported above. Moreover, as reported in [16], "NO 2 concentrations were taken from a stationary monitoring site on Oxford Street repeatedly passed during walks on Oxford Street. Because no monitoring was available in Hyde Park, NO 2 concentrations were taken from the nearest representative location sited in a school playground", which suggests the possible presence of a combination of the two error types. Although we do not consider the presence of Berkson error for the scope of this manuscript, the proposed models can easily be generalized to such a case. Finally, real-time measurements of noise, temperature and relative humidity were obtained at each exposure session. On such samples, different untargeted omic analyses were performed, including metabolomic analyses, on which we focus here. A more detailed description of the study is given in [16], where it is also reported that the study was approved by the UK National Research Ethics Service and that informed written consent was obtained from all participants. [17] conducted analyses to assess the short term metabolomic changes due to exposure to TRAP and identified various associations between air pollution concentrations and levels of metabolites in blood.

The model
The association between metabolite levels and TRAP exposures was assessed in a mixed model framework, using a Bayesian approach and including random effects for the individual, as well as for the location and time point of each measurements. Fixed effects were sex, age, BMI and health group (defined as a categorical variable, with healthy, COPD and IHD as levels), as well as average air pollution concentrations one year before the experiment, used as background or long-term exposure, and instantaneous measurements of the exposure of interest. The four exposures reported above (CBLK, NO 2 , PM 25 and PM 10 ) were considered separately.
The model was formulated as follows: with m ij being the metabolic feature of individual i at measurement j, Z the exposure, X including sex, age, BMI and health group, as well as the annual measurement of the exposure of interest and � ij � Nð0; s 2 e Þ. Random effects were included for the individual γ and for the measurement indicator δ, depending on the location and time point of each measurement, for each individual. Normally distributed priors with mean equal to zero and variance equal to 1000 were assigned to the regression coefficients β, and vague inverse gamma priors with shape and scale equal to 0.01 were assigned to the precisions of the random effects. A graphical representation of the model is reported in Fig 1, while a similar, frequentistic approach to the same model was used in [17]. Given the extreme high dimensionality of the omics dataset we pre-selected only the signals which reported p-values smaller than 0.5 in the frequentist analysis. Our choice of threshold was motivated to reduce the high dimensionality in the data and limit the analyses to a lower number of omic signals, as it would be very impractical to consider all metabolic features at the same time. This value is large enough to avoid false negatives. On the other hand, the hierarchical nature of our analyses leads to shrinkage and reduces the issue of false positives, further justifying the use of a large pvalue. To determine whether there was evidence of an association between the signal and the exposure of interest, we used posterior predictive probabilities to calculate Bayesian p-values as tail posterior probabilities. We then corrected for multiple testing using the Bayesian False Discovery Rate (FDR) with level 0.05 [22,23]. This model, and all models reported in this paper, were formulated using JAGS and INLA and the code to implement them is reported in the supplementary material. From now on, we will denote this model as the "naive" model, to indicate the fact that it does not account for the presence of measurement error, while we will use "corrected" for the model which includes a component for measurement error.

Classical measurement error
To model the presence of measurement error, we included an additional component in the model by formulating a classical measurement error for the different pollutants, i.e. assuming that the exposure variable Z can be observed only via a proxy W, such that with U � Nð0; s 2 u Þ. In the presence of a classical measurement error, an attenuation of the effect of the error-prone variable is expected, as the presence of the additional error variance biases the estimates of the regression parameters towards zero [24]. In fact, bearing in mind and assuming that the error in W is independent of m and of any other variables, it is quite straightforward to see that the error-prone regression parameter will be given byb and that the quantity that is estimated is b ?
It is important to underline that, even if an attenuation is usually expected, upward bias is also a possible consequence of classical measurement error even in relatively simple models, due for example to a correlation between covariates [24]. It is therefore necessary to disentangle the error variance from the true variance measured by the proxy in order to obtain unbiased estimates of the regression coefficients. To do so, we formulated a Bayesian hierarchical model, as the hierarchical formulation and the possibility to include prior knowledge provide a flexible way to model measurement error [25,26]. Such formulation was included in the model by adding a latent variable for the exposure, namely a normally distributed variable with mean equal to 0 and variance equal to the error variance. The underlying level of the structure was instead the same as in model (1), resulting in the following hierarchical structure: A vague gamma prior was used for the error variance, with shape and scale parameters equal to 0.01. For all the other variance components and regression coefficients, the same priors were used as in the naive model reported in the previous section.
A graphical representation of the model is reported in Fig 2. 4 Multivariate models

Dependency among omic signals
Another source of imprecision and possible bias in the assessment of the association between omic signals and TRAP exposure is potentially given by the formulation of independent models for each omic feature. Dependency across metabolic features is very likely to occur in practice, first of all because 5749 different features are sampled and analysed from the same 60 individuals, and second because they all reflect metabolic pathways and phenomena that are highly correlated in each individual. The flexibility and the layer structure of Bayesian hierarchical models makes it straightforward to account for such dependency, namely by using a multivariate response for the omic signals and introducing dependency in their variance covariance structure. The resulting model is a generalization of model (3) and was formulated as follows: where the response variable followed a multivariate normal distribution m � N(μ, S e ), with S e denoting the covariance matrix of the omics signals and all variances were given a Wishart prior with shape and scale equal to 0.01, in consistency with the univariate model. Note that this formulation with multivariate signals evaluates all the hypoteses at the same time rather than testing each one separately. This implies that model (4) does not require any correction for multiple testing, unlike the initial univariate model, as it naturally accounts for dependency across omics. Therefore, no Bayesian FDR was used in this context and associations were simply assessed by means of posterior predictive distributions.
To model the presence of the error in the exposure variable, we used the same reasoning as before and introduced a latent variable for the exposure, namely a normally distributed variable with mean equal to 0 and variance equal to the error variance. The error structure was again given by (2), following a classical error structure and a vague gamma prior was used for the error variance, with shape and scale parameters equal to 0.01, again in consistency with the formulation and priors used in the univariate model.
A graphical representation of the model is reported in Fig 3.

Dependency among different exposures
Dependency structures are very likely to occur in such studies also between different environmental exposures, as different works in the literature recently pointed out [27,28]. In particular, all the TRAP measurements show a general common trend based on traffic conditions, as well as weather, humidity and temperature, but also based on other confounding factors registered in each particular day. Moreover, individual-specific factors as activity patterns or time spent in specific locations, as well as respiratory rates can cause a variation in all pollutants at the same time, affecting in our specific case the measurements taken before and after the walk experiment. Such dependency appears to be very frequent between particulate matters of different sizes, as they are induced by similar mechanisms and conditions, besides being measured using the same techniques and instruments, and between PMs and black carbon. The Bayesian hierarchical structures employed in the previous sections can be further generalized to account for such correlation, by modelling more TRAP exposures at the same time. The resulting model was obtained as a generalization of (4) and formulated as follows: All coefficients and variables were modeled as before and the same priors were given to all of them. Again, the flexible formulation of Bayesian hierarchical structures allows to include as many pollutants as desired and to inform the model on their correlation and variances based on our knowledge. In analogy with the first univariate model, we added a latent variable for the multivariate exposure, following a normal distribution with mean equal to 0 and variance equal to the error variance. To account for dependency among pollutants, U was multivariate and modeled as a vector with four dimensions, corresponding to the four different pollutants. The variance covariance matrix of the multivariate exposure, as well as of the multivariate error, was given an uninformative Wishart prior to account for correlation among pollutants.
A graphical representation of the model is reported in Fig 4.

Implementation
We implemented our models using JAGS, as well as INLA. In terms of implementation and computation time, INLA performed faster than traditional MCMC samplers, specifically JAGS, but it showed more issues in the matter of space and memory burdens. In particular, when considering dependency among different omic signals, it was possible to implement the errorcorrected model in INLA with only up to 4 signals, while a substantially higher number of signals were included in the JAGS model. In particular, for the single omic model INLA took about a quarter of the time employed by JAGS. Nevertheless, when including omic-specific measurement error terms, the implementation of more than 4 signals was not possible in INLA, while it was still feasible in JAGS, although taking very long computational times. These aspects should be taken into account when considering studies with larger sample sizes, where such issues will have even higher impact on the analyses. In this matter, dimension reduction techniques might be employed to select only relevant omic signals, prior to the suggested analyses. All code to reproduce the analyses reported above using both frameworks are provided in the supplementary material.

Results
The inclusion of a classical measurement error term resulted in different estimates of the association between omic signals and TRAP measurements. Note that the presence of classical measurement error in pollutant measures can cause bias in different directions, and that the effect, as well as the direction of the error correction, is therefore not clear a priori. Fig 5 shows the estimates of regression parameters β expo obtained by models (1) and (2) for the selected omics. The error corrected models reported stronger associations between TRAP measurements and metabolomic signals, although with higher uncertainty, which reflects the uncertainty on the error component. While the naive models identified after FDR correction respectively 0, 2, 5 and 24 associations for CBLK, PM 25 , PM 10 , NO 2 , the error corrected models only identified 0, 1, 3 and 5 associations respectively, reflecting the propagation of uncertainty about the error into the posterior distributions of parameters of interest. These signals included a signal from the phenylalanine cluster already found to be associated with NO 2 in [17], as well as an unknown signal whose association with PM 10 was also detected by [17]. This also shows that some associations might be affected by an opposite impact and that attenuation is not the only possible consequence of measurement error. This is also reflected by the width of credible intervals which naturally increases when accounting for measurement error, as a consequence of the uncertainty about this component. When more information is available about the prior distribution of the error, such uncertainty will most likely decrease and potentially smaller credible intervals will be identified. Such correction acquired a more defined pattern when accounting for the dependency among different omic signals. In such case, the error corrected model retrieved estimates  which were constantly lower than the estimates from the naive models for most omic signals. When comparing the two naive models, with and without accounting for the dependency among signals, the retrieved estimates did not differ substantially. Fig 6 shows the estimates of regression parameters β expo obtained by model (4) with and without including a level for the error component. In this case the naive model identified respectively 1, 2, 2 and 4 associations for CBLK, PM 25 , PM 10 , NO 2 , while error corrected model identified respectively 6, 1, 4 and 51 associations. Note that the inclusion of a dependency structure among omic signals and of an error term in the model identifies more associations than the univariate models, because the presence of such dependencies can obscure associations among omic signals and TRAP measurements.
When accounting for dependency among signals and pollutants, all signals that were associated with any of the pollutants were considered. The correction reported by the error corrected models was in most cases higher than in the previous models. Accounting for dependency among pollutants and for correlated measurement errors also retrieved more associations than the other models, leading to higher estimates also between signals and pollutants that were not associated in the first cases. Fig 7 shows the difference in estimates of regression parameters β expo obtained by model (5) with and without including a level for the  error component. In this case the naive model identified respectively 18, 0, 25 and 16 association for CBLK, PM 25 , PM 10 , NO 2 , while error corrected model identified respectively 7, 1, 22 and 52 associations. This reflects the fact that ignoring the presence of measurement error can obscure associations between signals and pollutants and lead to an underestimation of effect sizes, as well as identifying potentially false associations. On the other hand, models including correlation structures identify even stronger effect sizes when corrected for measurement error, as well as a generally higher number of significant associations between signals and pollutants, which might be obscured when ignoring the presence of bias. In general, more associations were identified when the correlation among omics and among pollutants were modeled, and their number increased when a measurement error term was additionally included in the multivariate models.

Discussion
We implemented Bayesian hierarchical models to account for the presence of error in measurements of traffic related air pollution. This kind of formulation allows to model several dependency structures in a very flexible way, as well as to include an additional component for measurement error. In our work, we applied such methodology to the study of how TRAP measurements are associated with high-throughput molecular data, namely metabolic features sampled from the exposed individuals in a randomized crossover trial. Our application to the Oxford Street II study showed that the inclusion of a classical error term in the models resulted in corrections of the regression estimates whose extent and direction was not clear a priori, which underlines the importance of explicitly modelling the error component rather than predicting its effect based on prior beliefs. Regression estimates were also corrected by the inclusion of dependency structures in the models, namely dependency among different omic signals and among pollutants. The explicit formulation of such models was possible thanks to the flexible structure of Bayesian hierarchical models, and it was relatively straightforward to embed dependency and measurement error correction in the same hierarchical structure. This is certainly a major advantage of using Bayesian hierarchical models, which provide a general adaptable way to formulate a broad range of models and structures, as in our case measurement error or dependency structures, and more generally any additional random effect which might be needed in the analysis. Moreover, the use of a Bayesian framework allows to incorporate prior knowledge in the analysis, for example about the error component and parameters, as well as to reflect the prior uncertainty in the posterior distributions of parameters of interest. This requires some knowledge about the error component, in order to properly formulate the measurement error level and to assign reasonable priors. Note that this requirement is not specific to the Bayesian framework, but rather to any error modelling strategies. In fact, it is always necessary to know the error structure (i.e. classical measurement error and its distribution), as well as the error variance, in order to formulate an identifiable error model [29]. In practice, it is not always straightforward to obtain such information, and often assumptions about the error distribution and parameters are vague or potentially incorrect. The advantage of Bayesian models is that the uncertainty in such assumptions can easily be accounted for and propagated to posterior distributions of corrected estimates.
In comparison to general traditional analyses of the association between metabolite levels and air pollution, and specifically to the work of [17], our models allow to account for the presence of measurement error. As a consequence, the estimates of associations between metabolite levels and TRAP exposure are corrected and their absolute values are substantially higher., i.e. they indicate a stronger effect of TRAP on metabolomic features compared with no error adjustment. This practice, and the difference in the results when compared to traditional analyses, show also that ignoring the error can lead to a loss of reproducibility of the studies and to potentially inconsistent results. Moreover, -instead of considering each omic signal separately, we propose to model different signals jointly, so that their correlation is accounted for explicitly. The inclusion of a correlation among exposures is also new with respect to [17]. While traditional analyses are useful to select candidate signals associated with TRAP exposures, the models we suggest in this work explore the texture of this association in more detail, and correct for biases and correlation structures that might distort their quantification and direction. The models we propose use a Bayesian framework to model measurement error and correlation structures based on the data at hand and to incorporate prior beliefs with the evidence given by the observed data. Doing so, it is possible to account for dependencies among different omics and among pollutants, which can sometimes obscure associations between them and which were not taken into consideration in the traditional model, where associations were assessed separately for each omic signal with each pollutant. Moreover, we included a term for measurement error, which was not accounted for in the traditional model and which can also lead to a biased estimation of these associations.
With such framework, it is also possible to treat more general models, for example in terms of error structures. It is indeed sufficient to modify the error level in the hierarchical structure, for instance into a Berkson model. Moreover, given the general formulation of Bayesian hierarchical models, it is possible to include more complex dependency structures, for example spatial correlation between pollutants, which can be easily included in the Bayesian structure, especially with INLA [30]. Another possible extension of the method can also be the inclusion of different types of molecular data, which would require a specific and more complex dependency structure among omic signals.
Supporting information S1 Appendix. Code for implementation of all models. All models are implemented in JAGS in the following files, where "naive" stands for the model with no error correction, "multiomics" for the multivariate model with dependency among omic signals and "multiexpo" for the multivariate model with dependency among exposures.