Model choice for estimating the association between exposure to chemical mixtures and health outcomes: A simulation study

Challenges arise in researching health effects associated with chemical mixtures. Several methods have recently been proposed for estimating the association between health outcomes and exposure to chemical mixtures, but a formal simulation study comparing broad-ranging methods is lacking. We select five recently developed methods and evaluate their performance in estimating the exposure-response function, identifying active mixture components, and identifying interactions in a simulation study. Bayesian kernel machine regression (BKMR) and nonparametric Bayes shrinkage (NPB) were top-performing methods in our simulation study. BKMR and NPB outperformed other contemporary methods and traditional linear models in estimating the exposure-response function and identifying active mixture components. BKMR and NPB produced similar results in a data analysis of the effects of multipollutant exposure on lung function in children with asthma.


Introduction
Individuals are continuously exposed to complex mixtures of environmental chemicals. Mounting evidence from epidemiological studies links environmental exposures to increased morbidity and mortality [1][2][3][4][5]. Traditional epidemiological studies have focused on a single pollutant and additive models with a small number of exposures; however, studying pollutants in isolation can lead to biased estimates [6,7] and does not reflect the reality that people are jointly exposed to mixtures of pollutants. Hence, interest is rapidly growing in studying health outcomes associated with simultaneous exposure to mixtures of pollutants [8,9]. The National Institute for Environmental Health Sciences (NIEHS) identified the study of mixtures as a goal in its 2012-2017 strategic plan while noting that this will require novel quantitative approaches [10]. As such, numerous statistical methods have been proposed. There is a need to identify the most appropriate statistical methods currently available for estimating health outcomes associated with exposure to mixtures [11,12]. Studying health outcomes associated with exposure to mixtures is complicated by small effect sizes, highly correlated exposures, possible nonlinear and interaction effects, and often small sample sizes. In this context, traditional regression methods are often inadequate as they may yield biased or unstable estimates [13] and have low power to detect effects, especially in the case of nonlinear associations and interactions. Common methods designed for variable selection tend to incorrectly select predictors when many predictors are highly correlated [14] and classical model selection techniques ignore uncertainty in both the selected model and selected mixture components when estimating the exposure-response function [15,16].
In a broad literature review, Davalos et al. [17] identified five classes of methods currently used in mixtures analyses: additive main effects (AME), effect measure modification (EMM), unsupervised dimension reduction (UDR), supervised dimension reduction (SDR), and nonparametric (NP). AME and EMM methods are typically regression based. AME allows only additive effects, while EMM includes multiplicative interactions. Hierarchical and penalized regression methods have been applied to AME and EMM models to identify important mixture components and improve precision [18][19][20][21][22]. The next two groups are dimension reduction techniques (UDR and SDR) that transform exposure data to reduce the dimension of the predictor and, therefore, the required parameter space. UDR methods such as k-means [23,24] transform exposure data without regard to the health outcome [25][26][27][28]. SDR methods, including supervised principle components analysis [29], let the outcome inform exposure data transformation [30][31][32][33][34][35]. Finally, NP methods like Bayesian kernel machine regression [36] are flexible data-driven techniques for estimating a complex exposure-response function that may include interactions and nonlinear effects [37,38].
Choosing an appropriate statistical model depends on the research objectives [11,39] and requires understanding the empirical performance of methods. Recent studies have compared several methods in subsets of the model classes proposed. Among those evaluated include linear regression AME [40] and EMM methods [14], principle components analysis [34], structural equation models [41], Bayesian kernel machine regression [41], and Bayesian semiparametric regression [42]. These studies highlight challenges induced by highly correlated data in estimating complex exposure-response functions and characterizing uncertainty. To our knowledge, there has been no formal evaluation of methods from all five classes identified by Davalos et al. [17] in a single simulation study.
Evaluating the empirical performance of methods across a wide spectrum of model classes is important as it guides researchers in choosing across classes of models and aids in interpreting results and understanding the limitations of epidemiological studies using these methods. In addition, the existing literature is sparse with regards to a comparison among Bayesian methods, which are favorable in the multipollutant setting as they can incorporate prior information and fully characterize uncertainty [12,34,43]. To this end, we focus on a comparison of Bayesian methods across a variety of model classes in this paper. By comparing performance across classes of models, researchers can also gain insight into promising future directions for statistical methods development.
Motivated by research linking mixtures of air pollutant and pesticide exposures to child respiratory health, we conducted a simulation study to compare contemporary methods developed for estimating the association between health outcomes and exposure to mixtures. We considered one method from each of the five classes identified by Davalos et al. [17] and evaluated each method in three data-generating scenarios. The data-generating scenarios cover a range of linear to nonlinear functions of multiple pollutants with synergistic effects on the response in order to test each method in its ability to estimate both simple and complex exposure-response functions that may be encountered in practice.
In contrast to many recent studies that have compared methods from a conceptual standpoint or compared their performance in the analysis of a single data set, the primary contribution of our work is to compare diverse methods in a simulation study addressing a variety of research questions. Specifically, we quantified four aspects of model performance corresponding to previously identified epidemiological questions of interest: 1) how well does the model estimate the exposure-response function, 2) can the model identify important mixture components, 3) can the model identify components not associated with the outcome, and 4) can the model identify interactions among exposures [39].
A secondary contribution of our work is to provide software for the tested methods that currently lack software. Our simulation study describes the strengths and weaknesses of each method and available software encourages practitioners to use the most appropriate methods in a given application. Software is available in the form of the R package mmpack [44] to reproduce the simulation. Further, the software allows researchers to easily conduct a simulation study using the same methods and simulated exposure-response functions but substituting in their own exposure data which will have a different correlation structure and may result in different model performance. Hence, researchers can determine which methods are most appropriate for their own study. Finally, we applied each method to a data analysis of a cohort study investigating the relationship between air pollutant and pesticide exposures and lung function in children with asthma. We describe the differences in results among the methods, highlighting the importance of model choice.

Data
Health data. This study was approved by the Institutional Review Board of Colorado State University, Protocol Number 19-9437H. This was a secondary data analysis from a closed cohort with all personal identifying information stripped from the database. We used data from Fresno Asthmatic Children's Environment Study (FACES). The study design, including recruitment, eligibility criteria, and measurement procedures, is described elsewhere [45][46][47][48][49][50]. FACES includes data for children aged 6-11 years with asthma symptoms at the time of enrollment and living within a 20 kilometer radius of one of Fresno's EPA air quality monitoring sites. The health outcome of interest was baseline forced expiratory volume in the first second (FEV 1 ) measured via spirometry. We regressed FEV 1 on age, sex, height and ethnicity and used the residuals as the outcome in our data analysis [51][52][53]. Age, sex, height, and ethnicity are well-known predictors of FEV 1 so we remove all variation from these predictors before looking into the effects of air pollution and pesticide exposure on FEV 1 . Other covariates have not been as well studied regarding their association with FEV 1 and are including the model as potential confounding variables. Complete exposure, health, and covariate data were available for 153 children.
The data contain information on covariates and potential confounding variables (S1 Table in S1 File). We included average temperature and precipitation over three months, the temporal scale of the pesticide exposure data, prior to baseline as covariates. Subject-specific covariates include body mass index (BMI, kg/m 2 ) and indicators for: self-reported residence within one block of a freeway, any smoking in the home, positive atopy or allergy test, modified Global Initiative for Asthma (GINA) score � 3 at baseline, household income greater than $30K/year, mother having post-secondary education, child not covered by insurance, and season of baseline spirometry test. Temperature, precipitation, and BMI were scaled to have mean 0 and variance 1. Approximately 1% of the covariate data was missing, including any smoking in the home (16%), household income (3%), and mother having post-secondary education (1%). As all covariates with missing data were binary variables, we singly imputed the missing values with 0 and then added a dummy variable for each covariate with any missing data that indicated which values of that covariate were missing.
Air pollution and pesticide data. We obtained air pollution data from the EPA Air Quality System Data Mart. Air pollutant concentrations were calculated as 24-hour averages for particles � 2.5 μm in aerodynamic diameter (PM 2.5 ) and particles � 10 μm in aerodynamic diameter (PM 10 ), 8-hour daily maximum levels for ozone (O 3 ) and one-hour daily maximum levels for nitrogen dioxide (NO 2 ) [47]. Concentrations were taken from the air monitoring site closest to each child's residence and exposure levels were summarized as averages over three months prior to baseline spirometry tests to be consistent with available pesticide exposure data. Due to right-skewed distributions, air pollutant exposures were square-root transformed and then scaled to have mean 0 and variance 1.
We obtained data on the date, location, and amount (kilograms) of applied agricultural pesticides from the California Pesticide Use Report (PUR) [54]. Based on previous evidence linking pesticide exposure to respiratory illness [55,56], we considered three pesticide classes: carbamates (C), methyl bromide (MeBr), and organophosphates (OP). Pesticide exposures were estimated using the purexposure [57] package in R. We applied inverse distance weighting to the total reported pesticide amount over three months prior to baseline spirometry tests (as PUR reports are aggregated quarterly) to estimate exposures within a 3km buffer of each child's residence. Pesticide exposures were also highly skewed and so were square-root transformed and then scaled to have mean 0 and variance 1.

Statistical methods
Our primary interest was to estimate the association between exposures to p pollutants x i = (x i1 , . . ., x ip ) T and a continuous outcome y i , while controlling for q potential confounders w i = (w i1 , . . ., w iq ) T in a sample i = 1, . . ., n. We considered five recently proposed methods. The first two are the AME model nonparametric Bayes shrinkage with main effects only (NPBr) and the EMM model nonparametric Bayes shrinkage with main effects and all pairwise multiplicative interactions (NPB) as proposed by Herring [19]. The next two models are unsupervised (UPR) and supervised Bayesian profile regression (SPR) as proposed by Molitor et al. [58]. The fifth is the NP model Bayesian kernel machine regression (BKMR) [36]. We chose these methods since they represent the five classes identified by Davalos et al. [17] and are recently developed Bayesian methods for estimating health outcomes associated with exposure to mixtures. These five methods cover a variety of exposure-response function shapes, handle multicollinearity in various ways, and include options for variable selection. BKMR is presented exactly as proposed by Bobb et al. [36]; NPB and SPR have been modified to accommodate the continuous outcome with normal residuals rather than the logistic model originally proposed by Herring [19] and Molitor et al. [58], respectively; and NPBr and UPR are further modifications of those previously introduced methods. For a baseline comparison, we also included a normal linear model with main effects only (LM) and with all pairwise interactions (LM-int), both estimated with least squares. All models considered in this paper have the form where � i are independent N(0, σ 2 ) and h(x i ) represents the exposure-response function. All models were fit in R version 3.6.0 [59]. Nonparametric Bayes shrinkage. Nonparametric Bayes shrinkage [19] was originally introduced as a logistic regression EMM model and was adapted to the linear regression setting used here. We consider two variations. NPB, originally proposed by Herring [19] is an EMM model including main effects and all pairwise interactions, where NPBr is a reduced AME model not originally proposed in Herring [19] that includes only main effects: Both models place a Dirichlet Process (DP) prior on regression coefficients. The base distribution of the DP is a finite mixture of a normal distribution and a point mass at 0 to induce sparsity in the model. Hence, some coefficients are set exactly to 0, effectively selecting out variables that do not contribute to the health outcome. Correlated exposures can be clustered and assigned equal regression coefficients to reduce variance [19,60]. This effectively reparameterizes the model to have a single effect for the sum of two correlated predictors and is particularly advantageous in situations where it is difficult to differentiate the effects of two highly correlated predictors. The DP prior for main effects is constructed as: where δ 0 represents the Dirac delta function at 0. The model is completed with standard hyperpriors The DP prior for interactions is similarly constructed. Specifically, The hyperpriors on α 2 , π 02 , μ 2 , and � À 2 2 come from the same families specified for the main effects. The distributions on the main effects and interactions are independent a priori.
Posterior inclusion probabilities (PIPs) are calculated for each mixture component as the posterior probability of the regression coefficient being assigned a non-zero value. Both NPBr and NPB were fit using the R package mmpack [44].
Bayesian profile regression. Bayesian profile regression is a dimension reduction approach that classifies pollutant exposure profiles, x i , into a parsimonious set of clusters using a DP mixture model (DPMM) [58,61]. Each cluster represents a group of observations with similar exposure levels across the vector of pollutants. The health outcome is regressed on cluster indicators to estimate if profile x i is assigned to cluster c. We introduce a latent variable z i = c if exposure profile i is assigned to cluster c. Conditional on cluster assignment, the model for an individual exposure profile is The DPMM for cluster assignment places a truncated stick-breaking prior on the assignment probabilities to each cluster. The stick-breaking process and cluster assignment model are Subject to a maximum of C clusters, the DPMM allows the number of non-empty clusters to be estimated from the data. To identify the most optimal partitioning of the data, we follow the approach described in Dahl [62] and Molitor et al [58]. First, we construct an n × n score matrix at each iteration with a 1 in the i, j location if individuals i and j belong to the same cluster and a 0 otherwise. Then we calculate a probability matrix S by averaging the score matrices.
The most optimal clustering is the clustering from the MCMC iteration that has a score matrix with minimum least squared distance to the probability matrix S. We calculate model averaged estimates of the cluster-specific parameters θ c to incorporate the uncertainty present in the best clustering [58]. The model has been extended to include variable selection to identify mixture components actively contributing to cluster assignment [63][64][65]. Briefly, binary random variables are introduced that indicate whether the mean for a mixture component within a cluster is unique to that cluster or common among all clusters. Hence, mixture components that are selected into the model are interpreted as being informative in partitioning the exposure data into clusters, but are not necessarily related to the health outcome.
We consider two variations of profile regression. The first, supervised profile regression (SPR), originally introduced by Molitor et al. [58] belongs to the SDR class of methods since cluster assignments are influenced by the health outcome. The second is an unsupervised adaptation (UPR) not originally proposed by Molitor et al. [58] that belongs to the UDR class. The difference between the two variations manifests when the latent cluster assignment variable z i is updated. In the supervised case, we jointly model the response and estimate cluster assignments. Hence, there is feedback between the health outcome model and the profile assignment model where the health outcomes can influence cluster assignment. The full conditional for z i depends on both the likelihood of exposures x i and the likelihood of the response y i : Hence, in SPR, individuals with similar exposure profiles but different health outcomes may be assigned to different clusters depending on their responses.
The unsupervised case involves a two-step procedure where we first estimate cluster assignments independently of the response and then model the response conditional on cluster assignment. Here, z i depends only on the exposure likelihood: Since the response does not inform cluster assignment in UPR, there may be high uncertainty in the estimates of the cluster indicators θ c if individuals with similar exposure profiles have very different health outcomes. We fit SPR using the R package PReMiuM [65] and UPR using the R package mmpack developed for this paper [44].
Bayesian kernel machine regression. Bayesian kernel machine regression (BKMR) [36] belongs to the NP class of methods and flexibly models the exposure-response function to allow for nonlinear associations and higher order interactions. In BKMR, h(x) is a smooth function represented using a Gaussian kernel. The response is modeled as hyperparameter, and r = (r 1 , . . ., r p ) T are variable selection parameters. Estimated health outcomes for individuals with similar exposure levels across the p predictors are shrunk towards each other, resulting in a smooth but flexible exposure-response function. BKMR allows for both component-wise and hierarchical variable selection (HVS) to identify important mixture components. In our simulation and data analysis, we implemented component-wise variable selection and calculated PIPs for each exposure. We also implemented HVS in our data analysis to address sensitivity of results. We partitioned the mixture components into groups of air pollutants (PM 2.5 , PM 10 , NO 2 , and O 3 ) and pesticides (C, MeBr, and OP) and calculated PIPs for each group (group PIPs) and each component within a group, conditional on group inclusion (conditional PIPs). We fit BKMR using the R package bkmr [66].

Simulation study design
We evaluated the proposed methods in a simulation study. We ensure a realistic correlation structure among the pollutants by using the observed exposure data from 153 individuals in the FACES data set in our simulation study. We also use the observed covariate data in our simulation study. Health responses were simulated for three exposure-response scenarios, denoted h k , k = 1, 2, 3, as y i ¼ h k ðx i Þ þ w T i g þ ε i , with ε i * N(0, 1). The covariate coefficients γ 1 , . . ., γ q were simulated as independent N(0, 1).
The first scenario, h 1 (linear), is an EMM model. For exposures x j , j = 1, . . ., 4, the exposure-response function is Second, h 2 (nonlinear) includes nonlinear sigmoidal functions of three pollutants and a multiplicative interaction between two of those pollutants: Last, h 3 (fixed profiles) groups individuals into four distinct clusters based on dichotomous cut-offs for two pollutants. We assign a constant health effect to individuals in the same cluster: We selected these three exposure-response scenarios to cater to different methods in our simulation study. The linear scenario plays to NPBr and NPB, the nonlinear scenario plays to BKMR, and the fixed profiles scenario plays to UPR and SPR. We hypothesize that the methods to which each scenario caters will perform best in that scenario. We are interested in evaluating how methods perform in exposure-response scenarios for which they were not explicitly developed.
We simulated 200 data sets for each scenario and fit all five Bayesian methods plus LM and LM-int. As results can be sensitive to which pollutants, x j , j = 1, . . ., 4, are included in h(x), we randomly selected pollutants to be the active components in each simulated data set. All pollutants, even those not selected as one of the active components, are included as inputs in the estimated models. By randomly selecting which exposures are the active components of the mixture, each simulated data set has a different correlation structure among the active exposures, which adds robustness to our simulation study results. We calculated the Calinski-Harabasz index [67], the silhouette statistic [68], and the number of clusters to maximize the gap width [69] to measure the grouping structure of the data generated by the fixed profiles scenario. Although the exposure data remains the same for each data set, the exposures used in the exposure-response function differ for each data set; hence the clustering in the fixed profiles scenario, which is based on the response, differs for each data set. Across the 200 data sets used in our simulation study, the median Calinski-Harabasz index was 22.54, the median silhouette was 0.15, and the median number of clusters to maximize the gap width was 6. The distribution of each of these statistics can be found in S2 Table in S1 File. In general, the fixed profiles scenario did not always generate a strong grouping structure with this data, but instead represents a wide variety of clustering schemes.
We evaluated exposure-response function estimation using root mean squared error (RMSE) and interval coverage (Cvg). RMSE was calculated as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n P n i¼1 ½hðx i Þ Àĥðx i Þ� 2 q and coverage was calculated as the percent of h(x i )'s covered by 95% credible or confidence intervals. RMSE measures the variation between estimated and true values of the exposure-response function. Coverage measures how often the 95% credible or confidence intervals for the estimated exposure-response function capture the true mixture effect. A method with high RMSE and low coverage fails to capture the overall mixture effect. In this way, RMSE and coverage measure the ability of each method to capture the overall mixture effect. We summarized variable selection through true and false selection rates. In the Bayesian methods, we consider a variable with a PIP above 0.5 as selected into the model [70]. In LM and LM-int, a variable is selected if the 95% confidence interval for the respective regression coefficient does not contain 0. We calculated true selection rate (TSR) as the proportion of mixture components active in the exposure-response function as main effects that were selected into the model as main effects, and false selection rate (FSR) as the proportion of mixture components not in the exposure-response function as main effects that were selected into the model as main effects. All seven exposures are included in the models as inputs, but the active mixture components are those that define the exposure-response function for each simulated data set. For scenario 1, the active main effects are the randomly selected exposures denoted by x 1 , x 2 , x 3 , and x 4 ; for scenario 2, the active main effects are x 1 , x 2 , and x 3 ; and for scenario 3 the active main effects are x 1 and x 2 . In most methods (NPBr, UPR, SPR, BKMR, and LM), TSR and FSR are calculated only for main effects. In NPB we can calculate PIPs for interactions and in LM-int we can calculate confidence intervals for the interaction effects. Hence, we also evaluate variable selection rates for interactions in NPB and LM-int. We calculate true selection rate for interactions (TSR int ) as the proportion of the exact pairwise interactions active in the exposure response function that were selected into the model as interactions, and false selection rate for interactions (FSR int ) as the proportion of interactions that were not active in the exposure-response function that were selected into the model as interactions. In scenario 1, the true active interactions are x 1 x 2 and x 3 x 4 and in scenarios 2 and 3 the only active interaction is x 1 x 2 .
We assessed convergence for a few simulated data sets by visualizing trace plots and comparing results from multiple chains. We found evidence of convergence by 20,000 iterations for all methods. To ensure convergence across all simulated data sets, we based inference on 25,000 samples after a burn-in of 25,000 samples.
We conducted three additional simulation studies to further assess method performance. First, we considered a null scenario, h 4 (x), where none of the exposures are associated with the response. Second, we considered a complex mixture scenario, h 5 (x), where we simulated data for seven additional pollutants to have a total of 14 mixture components. Third, we applied our original simulation study design to a larger sample of size n = 1000 for each data set. Details on the additional simulations can be found in S1 Appendix in S1 File.

Data analysis
We conducted a data analysis on 153 individuals with complete data in the FACES data set. We used regression-adjusted FEV 1 as the outcome. S1 Table in S1 File summarizes the characteristics of the sample. Pesticide and air pollutant exposures and covariate data were identical to that in our simulation study (Tables 1, 2, and S1 Table in S1 File). We fit the same models as in the simulation study. Prior specification for the Bayesian models is listed in S2 Appendix in S1 File.

Simulation study results
Simulation results are shown in Table 3. Standard errors are shown in S3-S5 Tables in S1 File. We show the computational time for each method to run for 5000 iterations in Table 4.
Overall BKMR and NPB were the best performing methods with BKMR performing slightly better in the nonlinear and fixed profiles scenarios. Regarding RMSE for the exposure- The story is more complex when it comes to variable selection. While BKMR had the highest TSR in all three scenarios, it also had the highest FSR. Again, NPB performed very well in the linear scenario but not as well in the other scenarios, while UPR and SPR had consistently poor selection rates. Regarding TSR, BKMR (TSR = 1.00) and NPB (TSR = 0.92) performed best in the linear scenario. BKMR had the highest TSR in the nonlinear scenario (TSR = 0.96), where the next best methods, NPBr, NPB, and LM, all had mean TSR just under 0.80. BKMR is singled out with the best TSR in the fixed profiles scenario (TSR = 0.97). UPR, SPR, and LM-int tended to have low TSR in all three scenarios. A low false selection rate indicates a model does not erroneously classify exposures as associated with the outcome when they are not. Here, BKMR had some of the highest FSR across the three scenarios. In the linear scenario, LM-int (FSR = 0.04) and NPB (FSR = 0.10) had the lowest FSR. LM-int also had the lowest FSR in the nonlinear scenario (FSR = 0.08). In the fixed profiles scenario, NPBr, NPB, LM, LM-int all had similar FSR at or below 0.14. Along with BKMR, SPR had high FSR in all three scenarios.
When considering overall variable selection performance, NPB takes the top spot in the linear scenario, with high TSR and low FSR. No method was able to simultaneously achieve dominant TSR and FSR in the nonlinear or fixed profiles scenarios.
Only NPB and LM-int directly parameterized variable selection for interactions in an easily interpretable manner. Interpretable variable selection for interactions is itself an advantage of these approaches over the other methods. In the linear scenario, NPB (TSR int = 0.59) had higher TSR int than LM-int (TSR int = 0.32). Both methods had poor TSR int in the nonlinear and fixed profiles scenarios, with values at or below 0.25. Regarding FSR int , both methods performed well in all three scenarios, with FSR int consistently at or below 0.11.
The additional simulations produced similar results, with NPB and BKMR being consistently top-performing methods in terms of estimating the exposure-response function and identifying active mixture components. In the null scenario, NPBr and NPB had lowest FSR, meaning these methods were the best at not selecting any mixture components into the model when none were associated with the response (S6 Table in S1 File). Results from the complex mixture scenario largely mirrored those from the linear scenario (S7 Table in S1 File). BKMR and NPB remained top-performing in the larger sample size simulation and TSR improved for all methods. Here, UPR and SPR had high TSR and FSR, meaning they often selected all of the mixture components into the model (S8 Table in S1 File).

Data analysis results
The results from our analysis of the FACES data set varied across the methods. First we con- 0.08, 0.93) and an interaction between C and PM 2.5 (b: 0.28, CI: 0.01, 0.54) ( Table 5). The Table 5. Results from analysis of FACES data set using LM and LM-int. Table includes effect estimates (b), 95% confidence intervals, and associated p-values for all main effects in LM and LM-int plus the interaction effects in LM-int with p-values � 0.10. The regression coefficientb is the expected change in FEV 1 for a 1 standard deviation increase in the square root transformed exposures.

PLOS ONE
results from the linear models indicating a protective effect of PM 10 are counter-intuitive as there is an extensive literature on the deleterious health effects of PM on lung function. None of the other methods found evidence of protective effects for any of the exposures. Next we consider the five contemporary methods. NPBr did not identify any exposures with PIPs above 0.5. The exposure with the highest PIP was NO 2 (PIP = 0.47), which was estimated to be negatively associated with FEV 1 (b: -.08, CI: -0.35, 0.00). In NPB, NO 2 was selected (PIP = 0.60) and was also negatively associated with FEV 1 (b: -0.12, CI: -0.36, 0.00) ( Table 6). No other main effects or interactions were selected by either method (S10 Table in S1 File).
In BKMR, NO 2 was selected as an important mixture component with a PIP of 0.96 (S11 Table in S1 File). No other exposures had PIPs above 0.5. Results were similar using the HVS formulation (S12 Table in S1 File). NO 2 had a negative and nonlinear association with FEV 1 (Fig 1). To identify interactions, we plot the posterior distribution of the exposure-response function for each pair of exposures, holding all other exposures constant at their median values, and visually inspect changes in the response as both exposures change. In doing so we found no notable interactions among exposures (S1 Fig in S1 File).
As clustering algorithms, UPR and SPR reveal a different kind of story. UPR revealed four clusters as the best partitioning of the data. Each cluster had similar estimated health effects (Fig 2a); hence, despite partitioning the exposure space there was no meaningful association between the exposure profiles and the health outcome. Fig 2b-2e shows the empirical exposure means for individuals assigned to each cluster. The first cluster of n = 25 individuals was distinguished by higher than average exposure to MeBr. Cluster 2 (n = 33) had low exposure to OP and O 3 and high exposure to NO 2 and PM 2.5 relative to the average. The third cluster (n = 9) was characterized by relatively high exposure to OP and low exposure to O 3 . Individuals in cluster 4 (n = 86) had nearly average exposure to most pollutants except MeBR, which was notably below average; in addition, O 3 exposure was slightly above and PM 2.5 exposure was slightly below average. UPR selected OP (PIP = 0.57), O 3 (PIP = 0.54), NO 2 (PIP = 0.61), and PM 2.5 (PIP = 0.56) as important mixture components (S13 Table in S1 File).
SPR also revealed four clusters as the best partitioning of the data. The estimated exposureresponse function for cluster 3, the smallest cluster (n = 9), had a 0.97 posterior probability of being greater than the overall mean estimated exposure-response function (Fig 3a). The cluster sample sizes and associated empirical exposure means were very similar to those in UPR (Fig  3b-3e), with the labels switched for clusters 1 and 4. In both UPR and SPR, cluster 3 was the smallest cluster and had an estimated mean health effect higher than average, but there was more uncertainty around the health effect in UPR likely due to the two-stage approach for    Table in S1 File). We found the clustering and PIPs in UPR and SPR to be sensitive to prior choice particularly for the cluster-specific precision matrix and error precision.

Discussion
Interest is rapidly growing in estimating the association between exposure to mixtures of environmental chemicals and health outcomes. As a result, new statistical approaches have been developed for studying health outcomes associated with exposure to mixtures. The purpose of this paper was to evaluate and compare recently developed methods for mixtures and determine which research questions they answer well and in which scenarios. We limited our study to contemporary Bayesian methods since they are under-studied, under-utilized, and may have the ability to answer multiple research questions. Our results highlight the advantages of the flexible modeling and Bayesian framework of BKMR and NPB in estimating the exposureresponse function precisely and identifying mixture components most strongly associated with the health outcome.
Overall, BKMR was a top-performing method. In each of the scenarios, BKMR estimated the exposure-response function with coverage closest to the nominal level (0.95) and lowest RMSE. Despite being a more flexible approach based on Gaussian processes, BKMR had lower RMSE in the linear scenario than NPBr, LM, and LM-int, all of which assume linearity. This is likely because NPBr and LM do not account for interactions and LM-int can result in inflated standard errors in the presence of correlated data. BKMR identified active mixture components with the greatest frequency, but also included inactive components more often than other methods. Although we did not evaluate variable selection rates for interactions in BKMR in our simulation, BKMR can identify linear or nonlinear interactions among exposures through visualization or summarizing the posterior distribution of the exposure-response function. A drawback to BKMR is that results are not as easily interpreted as in NPB or the linear models, though there are currently efforts to enhance interpretation and a suite of visualization approaches that aid in interpretation. BKMR is an appealing choice for mixtures because it makes minimal assumptions on the shape of the exposure-response function and includes a sophisticated variable selection algorithm for identifying important mixture components.
NPB was top-performing in the linear scenario regarding estimating the exposure-response function, identifying both active and inactive mixture components, and identifying interactions. NPB performed well even when the exposure-response function was mildly nonlinear, but lacks the flexibility of BKMR for the fixed profiles scenario, which is highly nonlinear. The AME method NPBr poorly estimated the exposure-response function in the linear scenario, likely from not accounting for interactions. An advantage of NPB is its ease of interpretation, which is similar to interpreting a linear regression model. NPB estimates PIPs and effect sizes for all main effect and interaction terms, providing precise information regarding the contribution of each exposure to the mixture and its effect on the health outcome.
The profile regression methods, UPR and SPR, poorly addressed the research questions of interest in all three scenarios. Two explanations for this include lack of a clustering structure in the exposure data and a weak signal, both of which inhibit these methods from accurately estimating the multipollutant exposure-response function. Further, UPR and SPR do not have the ability to identify or estimate interactions or tease out individual effects of the pollutants within a mixture. These methods may not be appropriate for studies in which the primary objectives are to estimate the multipollutant exposure-response function and identify driving mixture components. As clustering methods, UPR and SPR are likely to perform better on data that has a strong grouping structure. Since we used a single data set in our simulation study, the results of our simulation should not be interpreted as representative of performance on all data structures. A particular advantage of UPR and SPR is that the number of clusters need not be pre-specified.
The linear model with interactions, LM-int, had coverage above 0.91 in all three scenarios, but had higher RMSE and lower TSR than BKMR and NPB. LM-int and NPB are both EMM methods, and NPB outperformed LM-int in the linear EMM scenario. LM and LM-int have the advantage of being easy to implement and interpret, but these methods estimated the exposure-response function with more uncertainty than the top-performing methods and generally lacked the ability to select truly active mixture components, likely due to high correlation among exposures.
In our application to the FACES data set, four methods (LM, LM-int, NPB, and BKMR) identified NO 2 as an important mixture component negatively associated with the health outcome. In addition, LM and LM-int estimated PM 10 to have a positive association with FEV 1 , and PM 10 was positively correlated with NO 2 . Further, the magnitude of the effect estimate for NO 2 in LM and LM-int was several times larger than that estimated in NPB, and the confidence intervals were also larger, reflecting more uncertainty. UPR and SPR also identified NO 2 as an important mixture component, but we cannot determine the sign of effect using these methods. Instead, UPR and SPR have the ability to estimate how the overall mixture is associated with the health outcome. UPR revealed four clusters with similar estimated health effects; hence, patterns in the exposure data were not associated with FEV 1 . In SPR, the smallest cluster was associated with higher average FEV 1 than the other clusters, suggesting an association between a relatively rare mixture of exposures and the health outcome. Alternatively, this small cluster may reflect a strong influence from the health outcome in the clustering using a supervised learner. Meanwhile, BKMR was able to describe a nonlinear association between NO 2 and FEV 1 .
Using missing indicators may have introduced some bias in the effect estimates. Additionally, all Bayesian methods are sensitive to prior specification and results may vary with more or less informative priors. PIPs are particularly sensitive to prior specification in all methods, so changing prior hyperparameters may lead to changes in TSR and FSR. We implemented all models using the default priors as specified by the authors to obtain an objective comparison of these methods.
Along with the primary research question, the best performing method is likely to depend on the exposure data. We used observed exposure data so our results are highly relevant to realistic settings. Our simulation results can be generalized to small data sets with a limited number of localized exposures, which is a frequent scenario in epidemiological studies.
In analyses of environmental mixtures and human health, model choice depends on the assumed exposure-response relationship and the primary questions of interest. NPB and BKMR are recently proposed methods that outperformed traditional regression models and offer promising tools for mixtures analyses. We recommend NPB when the exposure-response function is assumed to be approximately linear and a primary goal is accurately identifying which are the active and inactive components of the mixture. NPB is also highly interpretable and explicitly tests for interactions. We recommend BKMR if the exposure-response function is assumed to take on a complex form and the primary goal is estimating the form of the exposure-response function while at the same time identifying important mixture components. Our results suggest that UPR and SPR do not reliably answer our specified research questions, but may be applicable for different research questions such as pattern recognition. We further encourage users to take advantage of our R package mmpack [44] to replicate the simulation and determine how each method performs on their own data. Results will likely be different on different data sets. In particular, the profile regression methods may perform better on data that exhibits a stronger clustering structure in the fixed profiles scenario. We include the clustering statistics as part of the summary of the fixed profile scenario output so users can see how much grouping structure is in their own data. Replicating the simulation on their own data will enable users to choose the best method for their data and specific research question.