Application of ensemble methods to analyse the decline of organochlorine pesticides in relation to the interactions between age, gender and time

Organochlorine pesticides (OCPs) are toxic chemicals that persist in human tissue. Short and long term exposure to OCPs have been shown to have adverse effects on human health. This motivates studies into the concentrations of pesticides in humans. However these studies typically emphasise the analysis of the main effects of age group, gender and time of sample collection. The interactions between main effects can distinguish variation in OCP concentration such as the difference in concentrations between genders of the same age group as well as age groups over time. These are less studied but may be equally or more important in understanding effects of OCPs in a population. The aim of this study was to identify interactions relevant to understanding OCP concentrations and utilise them appropriately in models. We propose a two stage analysis comprising of boosted regression trees (BRTs) and hierarchical modelling to study OCP concentrations. BRTs are used to discover influential interactions between age group, gender and time of sampling. Hierarchical models are then employed to test and infer the effect of the interactions on OCP concentrations. Results of our analysis show that the best fitting model of an interaction effect varied between OCPs. The interaction between age group and gender was most influential for hexachlorobenzene (HCB) concentrations. There was strong evidence of an interaction effect between age group and time for β-hexachlorocyclohexane (β-HCH) concentrations in >60 year olds as well as an interaction effect between age group and gender for HCB concentrations for adults aged >45 years. This study highlights the need to consider appropriate interaction effects in the analysis of OCP concentrations and provides further insight into the interplay of main effects on OCP concentration trends.


Introduction
Organochlorine pesticides (OCPs) are a class of lipophilic persistent organic pollutants (POPs) that exist in the environment and accumulate in human tissue [1][2][3][4]. POPs are subject to long There are more pools for children aged 5-15 years in 2006/07 as these samples were originally collected for a different study as pools for the age groups 5-9, 9-12 ad 12-15 years [29].
OCP concentrations in serum are broadly representative of a person's body burden of OCPs. There are factors which can affect an individual's body burden such as exposure to OCP through occupation or lifestyle as well as factors affecting elimination such as breastfeeding or lipid mobilization as occurs during weight gain and loss [30]. An advantage of using pooled samples is that is has the effect of diluting any high or low concentrations to estimate the 'average' OCP concentration in the pool and eliminate variability.
Further details on data collection are provided in the Supporting Information (S1 File) and a summary of the data are provided in Thomas et al. [22]. The number of pools sampled at each time for each gender and age group is summarised in Table 1 below.
Ethics approval for this study was granted by The University of Queensland (UQ) Medical Research Ethics Committee and Queensland University of Technology (QUT) Ethics Committee (Ethics Approval Number: 2013000317). As per the ethics approval, the samples were deidentified and pooled therefore no consent was required from the participants.

Boosted regression trees
Boosted regression trees (BRTs) offer a non-parametric approach to predictive modelling by combining Classification and Regression trees (CART) [31] with a boosting algorithm [32] to refine model predictions. A CART is a decision tree that determines covariate-based rules to recursively split the data into subgroups of observations with a similar response [33]. Boosting is a stage wise process whereby multiple simple models are combined to enhance predictive performance [34]. In a BRT, the loss in predictive performance is represented by a loss function. For continuous responses, changes in deviance is a commonly chosen loss function [35]. The initial regression tree has the lowest possible loss function. Using the boosting algorithm, regression trees are sequentially applied to the residuals of the previous tree until there is no further improvement in predictive performance [36]. BRTs were adopted in this paper to determine the relative importance of age group, gender and time, and their interaction, for predicting OCP concentrations. For the data described in the earlier section, BRTs were applied separately to concentration measurements for HCB, β-HCH, trans-nonachlor, p,p'-DDT and p,p'-DDE. For the first inference, BRTs were used to determine the relative importance of predictor variables and possible interactions for each OCP.
The relative importance of a predictor variable in each BRT was calculated by weighting the number of times the variable was selected for a split by the squared improvement to the model caused by the split [36]. This improvement was determined by the least squares improvement criterion described in Friedman et al. [37]. The weighted value was averaged over all trees and scaled so that the sum of the relative importance of all predictor variables added up to 100 [36]. On this scale, larger values indicated stronger influence of the predictor variable on the response. The partial dependence plot of the categorical variables on the response illustrates the marginal effect of each fitted factor; a marginal effect greater than zero indicates higher concentrations than expected and values below zero indicate lower concentrations.
To evaluate the relative importance of interaction effects, each two-way interaction between predictor variables is related to BRT predictions using a linear model [36]. The residual variance of the linear model is then used as a measure of relative strength of the interaction fitted by the BRT [36].
All BRT models were fit in R v3.3.2 using the gbm package and brt extension [35,36] with 50% of the data used for each training and test set. Results of the BRT analysis are detailed in the Results section. Important interactions identified by BRTs motivated the development of models for the second stage of the analysis, the main results of the BRT analysis are also referenced in the next section.

Hierarchical model
BRTs are well suited for modelling the influence of predictors and possible interactions but cannot be used to test differences between sub-groups or assess the effect sizes. Therefore, hierarchical models were used to infer the effects of predictors and interactions on OCP concentrations. Hierarchical linear models expand upon the linear modelling framework for the analysis of nested data structures [38].
Three hierarchical models are outlined in this section, labelled Models 1-3. Each model assumed a common response variable, y agt , defined as the observed OCP concentration for age group a (a = 1, . . ., 5), gender g (g = 1, 2) and time t (t = 1, . . ., 5). To reduce notation, each model is defined for a single OCP.
Model 1 was motivated by the interaction between age group and time, which was identified as the most influential interaction for β-HCH, Trans-nonachlor, p,p'-DDE and p,p'-DDT. This model took the following form for each y agt : where the random effect, β at , modelled the interaction between age group and time. These effects were assumed to be temporally correlated, such that β a0 , . . ., β at was the predicted temporal trend in concentrations for age group a. The choice of hyperparameter values for α g , σ and ω were chosen to indicate a lack of prior knowledge about the nature of the relationships between the response and explanatory variables. Differences between gender were not included in this hierarchical model. Hence, Model 1 specified that the changes in OCP concentrations over time were the same for males and females but began at different concentrations depending on the age group.
For HCB concentrations, the BRT analysis revealed a key interaction between age group and gender. This finding motivated a second hierarchical model to account for this interaction, labelled Model 2 and defined in Eq (2).
In Model 2, the random effect, α ag was included to account for interactions between age group and gender. An intercept term for each age group, δ a , was also included. Model 2 specifies that the changes in OCP concentrations can be described by the differences between age groups and genders and are not affected by the time.
Model 3 accounted for all interactions with age group, therefore combining BRT findings across all OCPs: where each observation is described by random effects α ag and β at and an overall variance. α ag was defined by an overall effect of each gender. The definition of β at was as per Model 1.
Note that in Model 2, α ag was described by a normal distribution with a mean of δ a and an overall variance, whereas in Model 3, it was described by a normal distribution with a mean of δ g and an overall variance. As in Model 1, the priors for β a0 , δ a and δ g were non-informative normal distributions with a mean of zero and large variance. All variances were also given non-informative priors.
The second inference focused on the interaction effects on OCP concentrations. The posterior means and 95% credible intervals of the OCP concentrations were calculated from the posterior distribution of the parameter estimates to compare interactions and visualise the temporal trends.
The third inference summarised the changes in OCP concentrations over time. The percent change and 95% credible interval of OCP concentrations between the years of sample collection was calculated from the posterior distribution of parameter estimates generated by the hierarchical model. For an overall view of the change, the percent change in OCP concentrations between 2002/03 and 2012/13 was calculated. The change between each year of sample collection was also calculated to distinguish any specific trend. To evaluate cohort effects, estimated OCP concentrations for each age group in 2002/03 were compared with their extrapolated age group in 2012/13.
The proposed hierarchical models were estimated by Markov Chain Monte Carlo (MCMC) using 'JAGS' in R version v3.3.2 [39][40][41]. Parameter estimates were based on two MCMC chains each comprising 10,000 iterations following a burn-in period of 10,000 iterations. The MCMC chains reached convergence, as assessed by the Gelman-Rubin diagnostic (S1 Fig).

Model assessment
The BRT model fit was graphically evaluated by comparing the observed OCP concentrations to the model's predicted OCP concentrations. In a model with a good fit, the difference between the observed and predicted OCP concentrations would be small. On a graph with identical axes, the plotted values would be close to or on the diagonal line; larger differences would be plotted farther from the diagonal line.
The hierarchical models were assessed with two approaches, namely the predictive intervals (PI) [42] and deviance information criterion (DIC) [43].
For each hierarchical model, the (1 − α)% predictive intervals (PI) were determined for each observation, y i . In the study, three significance levels were considered, namely α = 0.2, 0.1 and 0.05. The proportion of observed values that fell into the respective PIs was calculated. For an adequate model, approximately (1 − α) of the observations should fall in their respective intervals [42].
The DIC [43] for each hierarchical model based on data y with parameters θ has the following form, where D(θ � ) is a measure of the goodness of fit of the model as represented by the posterior mean deviance, given an estimate of θ, and pD is the complexity of model measured by an estimate of the effective number of parameters defined by the difference between the average deviance and the deviance given an estimate of the model parameters, All models were also evaluated with root mean squared error (RMSE). RMSE is a measure of the differences between observed values and predicted values from a model. This measurement of goodness of fit was calculated by Eq 7.

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n where, in a sample of size n, y j andŷ j refer to observed and predicted values, respectively, for values j = 1, . . ., n.

Sensitivity assessment
Given the discrepancy in the number of samples for the 5-15 age group sampled in 2006/07 (Table 1), it was possible that the unbalanced nature of the data could distort the modelling results of the analysis. The sensitivity of the models with respect to the number of samples was therefore assessed as follows. Four pooled samples of the 5-15 age group in 2006/07 were randomly selected and analysed with the remaining age groups under Models 1, 2 and 3. Subsequent parameter estimates were compared with estimates obtained using the original number of samples.

Model assessment
BRTs. Model assessment demonstrated larger RMSE values for β-HCH and p,p'-DDT concentrations compared to other OCPs, indicative of larger residuals for the former ( Table 2). Graphical evaluation of the observed and predicted BRTs values indicated an acceptable goodness of fit for all OCPs (Fig 1). The predictions of β-HCH and p,p'-DDT concentrations from the BRT models were less accurate than those of HCB, Trans-nonachlor and p,p'-DDE concentrations.
Hierarchical model. The best fitting model varied between OCPs ( Table 3). The PIs obtained for Models 2 and 3 for Trans-nonachlor, p,p'-DDE and p,p'-DDT were similar. Based on this metric, Model 1 was best suited for β-HCH, Model 2 for HCB, Trans-nonachlor and p, p'-DDE, and Model 3 for p,p'-DDT. The DIC and RMSE values were smallest for Model 3 indicating that the most complex model had the best fit for all OCPs both with and without adjustment to the number of variables. The standard deviation of the posterior mean was also smallest for Model 3.

Variable importance
The relative importance of variables and factors was consistent for all OCPs while the relative contribution of interactions differed between OCPs.  HCB, β-HCH, Trans-nonachlor, p,p'-DDE and p,p'- Results of the BRT analysis (Fig 2) showed that the variable with the highest relative importance was age group followed by time and gender for all OCPs. The importance of age group in the analysis reflected the known persistence and accumulation of OCPs in the body [24,44]. The relative importance scores for gender was lowest among the predictors. The tree analysis also described the relative importance of interactions between the variables (Fig 3). Age group x gender had the largest relative importance among interaction terms for HCB concentrations; age group x time was the most important interaction term for β-HCH, Trans-nonachlor, p,p'-DDE and p,p' -DDT concentrations. These terms were later fed into the hierarchical model.
Factors that strongly influenced the concentration of HCB and p,p'-DDE included the older age groups 46-60 and >60 years, the earlier years 2002/03 and 2006/07 and gender (Fig 4). β-  , for HCB, β-HCH, Trans-nonachlor, p,p'-DDE and p,p'-DDT by 80%, 90% and 95% predictive intervals (PI), deviance information criteria (DIC), root mean squared error (RMSE) and standard deviation (σ) of the posterior mean (Post. mean) with a 95% credible interval. Percentages closer to their respective PIs and lower DIC, RMSE and σ values indicate model adequacy.  Ensemble method to analyse the effect of interactions between age, gender and time on OCP concentrations

Interaction effect on OCP concentrations
Hierarchical models demonstrated evidence of interaction effects between age group and time for β-HCH concentrations as well as age group and gender for HCB concentrations. Older age groups were also found to have higher HCB, Trans-nonchlor and p,p'-DDE concentrations than their younger counterparts.
Results of the hierarchical model are illustrated in Fig 5. There was strong evidence of an interaction effect between age group and time in β-HCH concentrations for adults aged >60 years; this age group had higher concentrations compared to children aged <17 years across all years of sample collection. The effect of an interaction between age group and gender was strongly supported in HCB concentrations for adults aged >45 years where women had higher concentrations than men; there was no evidence of the interaction for Trans-nonachlor and p.p'-DDE. The concentration of HCB in adults aged >60 years was higher than other age groups. The p,p'-DDE concentrations were higher in adults aged >60 years compared to 45-60 years and both age groups had higher concentrations than the younger age groups. The concentrations of Trans-nonachlor were higher in >60, 45-60 and 31-45 year olds compared to Ensemble method to analyse the effect of interactions between age, gender and time on OCP concentrations the younger age groups. There was no strong support for interaction effects between age group and time as well as age group and gender on p,p'-DDT concentrations.
Results of the sensitivity assessment with respect to the age group 5-15 years sampled in 2006/07 showed minimal impact on the posterior mean estimates. Therefore the discrepancy in the number of pooled samples did not distort the modelling results of the analysis. Results can be found in the Supporting Information (S2 Fig).

Changes in OCP concentrations
All age groups for β-HCH and p,p'-DDT showed a decrease in concentrations between 2002/ 03 and 2012/13 (Fig 6). There was strong evidence of a decrease in β-HCH concentration between 2002/03 and 2012/13 for adults aged 45-60 years. There were no distinct changes in β-HCH and p,p'-DDT concentrations between each consecutive time of sample collection Ensemble method to analyse the effect of interactions between age, gender and time on OCP concentrations (Fig 7). There was no strong evidence of a change in β-HCH and p,p'-DDT concentrations between an age group in 2002/03 and its extrapolated age group in 2012/13 (Fig 8). As time was not specified in Model 2, HCB, Trans-nonachlor and p,p'-DDE were excluded from this analysis. The interpreted effect of age on the concentrations of HCB, Trans-nonachlor and p, p'-DDE could also be related to birth cohort effects, taking into account the historical use of OCPs prior to 1985. Ensemble method to analyse the effect of interactions between age, gender and time on OCP concentrations

Discussion
This paper has proposed a two-stage analysis for the exploration of age group, gender and temporal effects, and their interaction, on OCP concentrations. Using BRTs and hierarchical Ensemble method to analyse the effect of interactions between age, gender and time on OCP concentrations modelling, three statistical inferences were presented, using data on serum OCP concentrations. These inferences focused on the relative importance of variables and interactions, the effects of interactions on OCP concentrations and the change in OCP concentrations over time. The inferences highlighted strong evidence for an interaction effect between age group and time for β-HCH concentrations in adults aged >60 years as well as age group and gender for HCB concentrations in adults aged >45 years.
The application of BRTs for identifying key interactions was motivated by its underlying non-parametric approach that allowed the tree to model non-linear associations as well as the model's robustness to outliers and missing data [32,36,45]. Information on the relative importance of all possible interactions, without needing to specify these terms prior to model implementation, is a key benefit of BRTs. Important interaction terms from the BRT were utilised in a hierarchical model which allowed multiple levels of the data to be included in a single model [46].
Previous analyses of OCP concentrations over multiple years have demonstrated a decline in OCP concentrations over time [18,19,47,48]. A major cause of the declining trend has been attributed to the ban of the commercial or agricultural use of OCPs in various countries [18,19].
In this paper, there was a similar negative trend for both β-HCH and p,p'-DDT concentrations, but the analysis showed no strong evidence that there was a substantive decrease in these OCP concentrations. This is not unexpected, since the study period spans years, long after major historical emissions [8] and the smaller ongoing emissions during the study period may result in lower and more diffuse exposure in humans, based on latent pathways into human serum [18,49,50].
In studies where OCP concentrations were sampled from various age groups, lower concentration of chemicals have been found in younger age groups [27,[51][52][53]. The same trend was observed in this study with the two-stage analysis providing strong evidence for larger OCP concentrations in older age groups. The difference in concentrations between age groups has been attributed to the implementation of bans on OCPs; older age groups had a longer exposure history to OCPs as they had lived through the agricultural use of the chemicals and hence had accumulated higher concentrations [54,55] compared to their younger counterparts.
A number of studies had also found differences in OCP concentrations between genders; some studies found higher concentrations in women [26,56,57] while other studies found higher concentrations in men [27,58,59]. In the current study, the analysis showed higher concentrations of HCB, β-HCH, p,p'-DDE and p,p'-DDT in women, and strong evidence of this in HCB concentrations. Hypotheses on why there is a difference in concentrations between genders differ depending on age of the females. For parous (corresponding to our 15-30 and 31-45 year olds) females there is excretion due to placental transfer and breastfeeding [60]. In older females, this factor as discussed by Salihovic et al. (2012) is no longer relevant as the time since breastfeeding has long since past. While Salihovic et al. [59] suggest that females have higher body fat than males, Porta et al. [26] investigated OCPs and BMI and could not explain the higher concentrations in older females compared to males. Further work is required on gender differences in regard to OCP concentrations if there are likely to be any clinical repercussions.
Further to the main effects of age group, gender and time, most studies on OCP have not analysed the effect of interactions on the concentrations [21]. Thomas et al. [22] found that the interaction between age group and time had a significant effect on the HCB concentration and was best implemented for modelling β-HCH and p,p'-DDT. Thomas et al. [22] also found a significant interaction effect between age group and gender on p,p'-DDT concentrations. In this study, HCB, Trans-nonachlor and p,p'-DDE concentrations were best modelled by the interaction between age group and gender, without a temporal effect. This suggested that there was negligible change in OCP concentrations over the times of sample collection that were studied. This is possibly related to the indirect pathways of HCB, Trans-nonachlor and p,p'-DDE [18,49,50] into human serum long after its historical use [8]. Gender did not have a strong effect on the β-HCH concentrations which were modelled as a function of the interaction between age group and time. p,p'-DDT concentrations were modelled by interactions between age group and gender as well age group and time demonstrating the effect of age group, gender and time on the change in concentrations.

Limitations
Results did not provide a complete overview of temporal trends across all age groups as no data was collected for the years 2002/03 for 0-4 year olds, and 2004/05 for all age groups.
The analysis was unable to account for confounding factors such as metabolism rate, diet, place of residence and weight loss due to the pooled sampling method adopted to collect data.

Extensions
The hierarchical model utilised in this paper could be further extended to incorporate prior knowledge, elicited from experts or meta-analyses of published studies. It may be more appropriate for prior information to come from experts in the field as there are no standardised methods for laboratory analysis, detection limits, number of people sampled, age ranges, years sampled, number of years sampled and whether samples were pooled to lower laboratory costs [26,61]. Prior information can be included as information on the coefficient values for the interactions between age group and gender, and age group and time.
In the current study, each OCP was modelled individually. This is useful as each chemical has a different history of production and use. However, OCPs could be modelled in a multivariate analysis that accounts for any positive correlations between OCPs such as that between p, p'-DDT and p,p'-DDE.
The model can also be applied to other POPs such as perfluorinated chemicals and polybrominated diphenyl ethers in order to investigate the effects of age group, gender and time as well as any interactions between these variables.