Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression

National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants during 1979–2015; (2) explore systematically the impact on model performance of the number of variables selected for inclusion in a model; and (3) provide publicly available model predictions. We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979–2015. We also use ~350 geographic characteristics at each location including measures of traffic, land use, land cover, and satellite-based estimates of air pollution. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of geographic variables. For all pollutants and years, we compare three approaches for choosing variables to include in the PLS model: (1) no variables, (2) a limited number of variables selected from the full set by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional and spatially-clustered test data. Models using 3 to 30 variables selected from the full set generally have the best performance across all pollutants and years (median R2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Concentration estimates for all Census Blocks reveal generally decreasing concentrations over several decades with local heterogeneity. Our findings suggest that national prediction models can be built by empirically selecting only a small number of important variables to provide robust concentration estimates. Model estimates are freely available online.

Introduction Regulatory monitors and project-based monitoring campaigns typically provide air pollution measurements that are limited in space and time. Empirical models are a cost-effective approach to estimate fine-scale exposures to air pollution. Recent population-level studies of air pollution have relied on empirical models to estimate long-term concentrations of outdoor air pollution based largely on observation-driven geostatistical approaches [1][2][3] or hybrid approaches that incorporate satellite-based observations of air quality and theory-based mechanistic models with geostatistical approaches [4][5][6][7]. These model predictions are used to assess population-level characteristics of air pollution, such as health effects [8][9][10][11], the burden of disease [12,13], and exposure disparities [14,15].
Many empirical models of air pollution are developed using a large suite of input data often including hundreds of geographic covariates (e.g., traffic, population, land use) with the goal of predicting concentrations at locations lacking monitoring data [16]. More recently, studies have included estimates of air pollution from mechanistic models [17,18] and satellite-based air pollution measurements such as tropospheric nitrogen dioxide (NO 2 ) column abundance and Aerosol Optical Depth (AOD) [19,20]. These regional air pollution estimates are particularly useful for national-or global-scale prediction where air pollution measurements are sparse over large areas [4,7,[21][22][23][24][25]. To incorporate and prioritize information from the many hundreds of predictor variables, studies typically employ conventional statistical techniques such as variable selection, shrinkage, and dimensional-reduction [1,26,27]; more recently some studies have applied machine learning techniques such as neural network [4,28].
Computational demands of applying models that include hundreds of variables are large, especially for the models aiming for extended prediction areas such as national scale. Yet, there is little guidance in the literature regarding the added benefit of using hundreds of variables versus parsimonious models based on a few or a couple dozen empirically-selected variables. Furthermore, although some national-scale models exist for specific years and pollutants for particulate matter less than or equal to 2.5 or 10 microns in diameter (PM 2.5 or PM 10 ), NO 2 , or ozone [25,27,29], empirical models developed under a unified framework do not currently exist for most criteria pollutants across all years with regulatory monitoring data in the U.S. This article aims to address both of those gaps. Specifically, we develop, test, and compare full versus parsimonious national models that predict annual average concentrations of six criteria pollutants and for all years with available monitoring data during 1979-2015. We test the hypothesis that model performance is better with more variables than with a smaller number of empirically selected variables from the full dataset. We compare the performance from the models using different numbers of variables within an identical modeling framework to focus on the comparison across different sizes of subsets. Then, we select the best-performing models to generate concentration estimates for all residential Census Block centroids in the contiguous U.S. for all modeling years with the goal of making our model predictions available freely online.
We refer to our models as "Integrated Empirical Geographic" (IEG) regression models to indicate key characteristics of the model and to be effectively acknowledgeable in other planned analyses: "integrated" because they include many datasets (land use, satellite-derived measures of air pollution, and emission estimates); "empirical" because the relationship derived is empirical (rather than based on theory [physics, chemistry]) and because the model is based on measured concentrations; and, "geographic" because the model is based on geography and geographic variables, and also because it includes kriging, a geostatistical method for spatial prediction.

Regulatory monitoring data for criteria pollutants
We downloaded measurements of six criteria pollutants including PM 10 , PM 2.5 , NO 2 , sulfur dioxide (SO 2 ), carbon monoxide (CO), and ozone (O 3 ) at all Air Quality System (AQS) monitoring sites for all available years from 1979 through 2015 via the U.S. Environmental Protection Agency (EPA) AQS data repository (https://www.epa.gov/outdoor-air-quality-data) (S1 Fig). Criteria pollutants are a list of air pollutants that are known as their harmful effects on health, and monitored and managed on a national level to achieve the compliance with the air quality standards. Whereas gaseous pollutants such as NO 2 , SO 2 , CO, and O 3 are measured every hour, PM is collected on the daily basis. NO 2 , SO 2 , and ozone are available for the entire period ; CO, PM 10 , and PM 2.5 are available starting in 1990, 1988, and 1999, respectively. For PM 10 and PM 2.5 , we used data from the Federal Reference Method (FRM) and Integrated Monitoring of Protected Visual Environments (IMProVE) networks (http:// vista.cira.colostate.edu/Improve/).
We computed annual averages for all pollutants (except ozone) at sites that meet our inclusion criteria, as follows. We computed 24-hour averages for monitors with 18 or more valid hourly measurements in that day, and then computed annual averages at sites with a minimum number of operating days (244 days for daily/hourly measurements, 61 days for 1-in-3 day measurements, and 41 days for 1-in-6 day measurements) during a year and no more than 45 consecutive days without a measurement. For ozone, we computed the daily maximum of the 8-hour moving average from hourly measurements for monitors with 18 or more operating hours during the day and computed an ozone season average from May through September. We selected these summer-season averages of daily 8-hour maximum for ozone because ozone production is predominant in summer through photochemical reaction catalyzed by heat and sunlight and it is likely that its health effect is mostly affected by summer time ozone concentrations. The IEG regression modeling was done after applying square root transformation to all pollutant concentrations to meet the normality assumption.

Geographic variables
We considered > 900 geographic variables as independent variables for our IEG models, in eleven categories: traffic, population, urban land-use or land-cover, rural land-use or landcover, elevation, vegetation, imperviousness, industrial emissions, position, source, and satellite air pollution estimates (S1 Table). S2 Fig shows the diagram of the procedures including data preprocessing and variable computation (detailed information is also available in https:// www.uwchscc.org/MESAAP/Documents/MESAAirDOOP.pdf ). To reflect changes of land use characteristics over time, we obtained the two types of land use variables from groundbased datasets generated in 1970s and 1980s, and satellite and aerial imagery in 2006. The variables were computed as summaries within buffer areas between 50 meters and 15 kilometers (0.05, 0.1, 0.15, 0.3, 0.4, 0.5, 0.75, 1, 1.5, 3, 5, 10, and 15 km) that were applied differently by the variables depending on their local or regional impacts on air pollution (S1 Table). From > 900 variables, we excluded variables with little spatial variability (e.g., same values at the 10 th and 90 th percentiles) or few unique values. This exclusion resulted in reduction of the number of variables to an average of~350 as the full set of variables for a given pollutant and year (S3 Fig).
Traffic variables are distance to the nearest road and sum of road lengths within eleven circular buffers (0.05, 0.1, 0.15, 0.3, 0.4, 0.5, 0.75, 1, 1.5, 3, and 5 km) based on TeleAtlas data (http://www.teleatlas.com/OurProducts/MapData/Dynamap/index.htm). Population variables are the number of people in eight circular buffers (0.5, 0.75, 1, 1.5, 3, 5, 10, and 15 km), based on year-2000 U.S. Census population (http://arcdata.esri.com/data/tiger2000/tiger_download. cfm). Land use variables in 1970s and 1980s are percent of areas in circular buffers for various land use characteristics such as residential, industrial, commercial, and agricultural land use identified by the U.S. Geological Survey (http://water.usgs.gov/GIS/dsdl/ds240/index.html). Land cover variables based on satellite imagery in 2006 are percent of areas in circular buffers for land use characteristics such as developed high and low density obtained from the Multi-Resolution Land Cover Characteristics (MRLC) Consortium (http://www.mrlc.gov/index. php). Elevation is the absolute elevation measurement at a given location and relative elevation compared to elevation at grid points in a circular buffer area, calculated from national elevation dataset based on satellite imagery (http://nationalmap.gov/elevation.htm). Vegetation variables are normalized difference vegetation index computed from satellite imagery (http://glcf. umd.edu/data/ndvi/) computed in circular buffer areas. Emission variables are the total amount emission estimates in circular buffer areas based on national emission inventory data (http://www.epa.gov/ttn/chief/net/2002inventory.html).

Modeling approach
Our approach builds on a universal kriging framework, as described in our previous studies [25,27,35], that partitions annual average concentrations into two components [36]: variance and mean. The variance component is modeled with variogram using exponential covariance function and three covariance parameters: range (the distance at which spatial correlation exists), partial sill (spatial variability), and nugget (non-spatial variability) (see the S1 File). The mean component includes a few dimension-reduced summary predictors estimated using partial least squares (PLS) from the geographic variables offered. We estimated two and three PLS predictors from conventional and clustered cross-validation, respectively, (see the next section, "Model Evaluation") based on the previous study that showed the best model performance using two to three PLS predictors in the same modeling approach for the NO 2 national model in U.S. [25]. The mean component is equivalent to the linear regression model often referred to as land use regression (LUR) with PLS data-reduction. Whereas other dimension reduction approaches such as principal component analysis solely rely on correlation of covariates, PLS predictors are estimated based on the correlation between covariates and the outcome; PLS was adopted in several previous prediction models [2,3,18,25,27,35]. The summary predictors not only incorporate various geographic characteristics, they also avoid producing extreme predictions [37]. All regression parameters of PLS predictors and covariance parameters were estimated by maximum likelihood method. The model was applied by each pollutant and each year. We implemented all IEG models in R (ver. 3.5.1; R Development Core Team, Vienna, Austria, https://www.r-project.org/).
To investigate the role of model parsimony, we empirically selected via forward selection a specific number of variables from the full set of~350 to offer the PLS; we investigated how model performance varies depending on the number of variables selected. The number of variables selected ranges from zero (i.e., no variables-this is ordinary kriging that assumes a constant mean without any variables) to the full covariate database, with several intermediate values up to approximately one third of the all variables (3-, 5-, 7-, 10-, 13-, 16-, 20-, 25-, 30-, 60-, 90-, and 120-variable models). For example, the 20-variable model would involve forward selection to select the best 20 variables from the full set, followed by PLS data-reduction to identify two or three PLS components comprised of those 20 variables, and regression modeling based on those two or three PLS components. PLS has been shown to be a useful tool for variable selection in previous studies where the set of variables being considered is limited and there is a clear understanding of their scientific importance [38]. However, we restricted our application of PLS to estimation of summary predictors from selected subsets of variables because we had no a priori reason to limit the full set of~350 variables.
We hypothesized that adding more variables will always improve the model somewhat, but perhaps by diminishing amounts as more variables are added. In that case, there may be a "point of diminishing returns": a number of variables for which adding more variables yields little additional benefit.

Model evaluation
We evaluated models using two types of 10-fold cross-validation (CV): conventional and spatially clustered [25]. Whereas conventional CV randomly generates groups of sites as training and test data sets, spatially clustered CV is based on those groups constructed as specific spatial clusters. For conventional CV, we randomly divided all monitoring sites into 10 groups. Then, we selected one group as the hold-out sites, developed models using the remaining data, and predicted air pollution concentrations at hold-out sites. This process was repeated separately for each of the 10 groups to create a pseudo-independent test data set. Spatially clustered CV is similar except that the 10 groups are spatial clusters identified using k-means clustering (S4 Fig) [25]. Conventional CV reflects model performance at a random location, whereas clustered CV reflects model performance far from a monitor. For dense monitoring networks, such as PM 2.5 in the U.S., conventional CV may be more representative of model performance where most people live.
CV statistics include root-mean-square error (RMSE) and the MSE-based R-squared statistic (R 2 ). The MSE-based R 2 is calculated as 1 minus the ratio of MSE to data variability, whereas a conventional R 2 is calculated as the squared correlation coefficient. Conventional R 2 assesses agreement between predictions and observations about the regression line; MSEbased R 2 instead assesses agreement about the 1:1 line [2,37]. To allow for comparison across different pollutants, we also computed standardized RMSE (i.e., RMSE divided by the mean concentration across all sites). For each pollutant and year, the "best" and "worst" models are identified based on R 2 and standardized RMSE from both conventional and clustered CV.

Sensitivity analyses
As described next, we conducted sensitivity analyses to investigate the contribution of a specific category of geographic variables, the impact of model evaluation choices, the performance of an alternative modeling approach, and the impact of different pollutant metrics.
To investigate whether a specific category of geographic variables gives higher contribution than others to prediction, we developed the model by separately excluding each category of variables such as traffic, land use, and satellite air pollution estimates (see above and S1 Table) in turn. Then, we investigated which category shows the largest decline in model performance when excluded.
To shed light on whether the selected best and worst models are sensitive to the type of CV approach, we used the best models chosen by one CV and recomputed CV statistics by the other CV (i.e., re-compute conventional CV for the best models chosen by clustered CV, and re-compute clustered CV for the best models chosen by conventional CV). To assess the impact of forward selection during model-building, we replaced forward selection with random selection of the same number of variables and compared model performance. To more completely evaluate our models by addressing model selection as well as estimation of regression and covariance parameters in universal kriging, we constructed our CV to include forward selection and estimation of PLS predictors in addition to parameter estimation as a more complete model evaluation. In addition, we performed an additional CV in which we take out all sites within a certain radius of a buffer instead of one site in conventional CV [39]. This conventional buffer-out CV intends to avoid possible overestimation of model performance resulting from monitoring data collected at the neighboring sites and correlated with those at hold-out sites. For buffer size, we used 50, 100, 200, and 300 km radii based on our investigation of histogram and variogram for monitoring data. As an alternative modeling approach, we applied least absolute shrinkage and selection operator (lasso) [40] to select subsets of variables for PLS and compared the performance to that of our original approach using forward selection. We applied the sensitivity analyses for categories of geographic variables, model evaluation, and modeling approaches to limited examples: two pollutants for NO 2 and PM 2.5 , and one year in 2000.
Lastly, we tested the robustness of ozone models to other ozone averaging approaches: annual and summer season (May-September) summaries of ozone using 24-hour means, 8-hour means, and 1-hour maximum.

Prediction
Using the best models for each pollutant and year, we predicted annual average concentrations for the~7 million residential Census Block centroids in the contiguous U.S. with nonzero population. Then, we computed population-weighted averages at various geographic scales (Census Block Groups, Census Tracts, Counties, States, and contiguous U.S.) based on 2010 Census boundaries, and explored the national distribution of pollutant concentrations over space and time.

Summary of monitoring data
Means and standard deviations of annual average concentrations at AQS monitoring sites decrease over time for all pollutants (S3 Table, Fig 1). During 1980 to 2010, average concentrations decrease almost 6-fold for SO 2 (from 12.7 to 2.2 ppb) but only 14% for ozone (from 52.0 to 45.8 ppb). For ozone, the 10 th percentile concentration decreases less than 2% over 30 years (from 7.8 to 37.2 ppb). From 2000 to 2010, reductions for PM 2.5 and PM 10 are 39% and 28%, respectively.

IEG model performance by number of variables
Different from our hypothesis, adding more variables does not consistently improve model performance, especially for clustered CV (S5 Fig). For all pollutants and for both CV approaches, models using 3-30 variables generally show higher R 2 and lower standardized RMSE than models using no or all variables (Table 1, Fig 2 and S6 Fig).
The no-variable (i.e., ordinary kriging) models generally perform worst (S7 Fig). Selecting best-performing models generally is consistent among metrics (MSE-R 2 , standardized RMSE), and is typically robust to using the two types of CV (S8 Fig). S4 Table shows the medians of three estimated covariance parameters across 10 CV groups and the contribution of PLS predictors to CV-ed predictions by the models including different numbers of variables for NO 2 and PM 2.5 in 2000. The contribution of PLS predictors to CV-ed predictions was computed as the proportion of CV-ed predictions by PLS predictors to CV-ed predictions by PLS predictors and kriging. For NO 2 , median covariance parameters are similar between some and all variable models but notably different from those in no-variable models particularly for the range parameter. However, parameters for PM 2.5 were similar between some variable models compared to no or all variable models. Overall, the contribution of PLS predictors to CV-ed predictions is dominant with small contributions of spatial variability as shown in small partial sill. These patterns were consistent between some-variable and all-variable models for both pollutants.

IEG model performance by CV
The patterns of good model performance with small numbers of variables are generally similar between the two CV approaches. However, CV results consistently indicate better model performance using conventional CV than using clustered CV (Table 1

IEG model performance by pollutant
Parsimonious models for PM 2.5 and NO 2 show generally good performance using conventional CV: median R 2 (standardized RMSE) of the best models are 0.86 (0.13) for PM 2.5 and 0.87 (0.21) for NO 2 (Table 1, Fig 2). Analogous results using clustered CV are 0.65 (0.20) for PM 2.5 , and 0.80 (0.24) for NO 2 . For NO 2 , differences in model performance between "best" and "no variable" models are large particularly for clustered CV (median R 2 for the best/novariable model: 0.87/0.61 [conventional CV] and 0.80/0.00 [clustered CV]). That finding indicates the substantial benefit of having variables in the model when there are no monitors nearby and indicates that the ordinary kriging NO 2 model including no variables offers nearly zero information far from monitors. In contrast, for SO 2 , ozone, and PM 10 , differences between "best" and "no variable" models are modest for conventional CV (median R 2 for best/ no-variable models: 0.59/0.57 [SO 2 ], 0.75/0.72 [ozone], 0.59/0.49 [PM 10 ]) (Fig 2 and S6 Fig) Fig). Overall, for both CV approaches, NO 2 and PM 2.5 yield better models than other pollutants (Fig 3). Over time, model performance tends to improve for ozone and PM 10 , decline for SO 2 and CO, and remain relatively unchanged for PM 2.5 and NO 2 .

Selected variables
Investigation of covariates by category chosen via forward selection (Fig 4 and S9 Fig)

Sensitivity analyses
In our sensitivity analysis of re-computing CV statistics using conventional or clustered CV for the best and worst models selected based on the other CV approach, both CV approaches mostly give the identical best and worst models. The three sensitivity analyses conducted on NO 2 and PM 2.5 for 2000 indicate the following. First, model performance is highly degraded when satellite variables are not included (S10 Fig), especially for clustered CV. The inclusion of land use variables becomes more important for models with larger numbers of variables. Second, when variable selection is random rather than via forward selection, model performance is noticeably reduced (S11 Fig). The performance gradually improves as more variables are added. However, even with random selection of variables, the improvement in performance for models with all variables relative to models with~30 variables is small when using conventional CV. Thus, we find that even using a subset of randomly selected variables can yield models that are comparable to the "all variable" models. Third, when we shift the CV procedure to make it broader to include the entire modelbuilding endeavor including forward selection rather than PLS and universal kriging, clustered CV results for NO 2 show consistent patterns as with the core results of better performance with subsets of variables than all variables (S12 Fig). With clustered CV for PM 2.5 and conventional CV for NO 2 and PM 2.5 , expanding the aspects of modeling included in the CV procedure reduces the difference in model performance between "some variables" models and "all variables" models. CV statistics from conventional buffer-out CV give the similar pattern of better model performance using 3-30 variables but with mid-range CV statistics between National prediction of concentrations of criteria pollutants using a parsimonious approach conventional and clustered CVs (S13 Fig). In the comparison to the models using lasso for PLS, we found similar model performance with our approach using forward selection (S5 Table). Sensitivity analyses involving alternative metrics of ozone concentration reveal that our original approach using 8-hour moving averages during the summer season shows the best performance (S6 Table).

Model application
Predicted annual-average concentrations throughout the U.S. (Fig 5, and S14 and S15 Figs), generated using "best" models, reflect decreasing concentrations over 10-30 years depending on the pollutant. The extent of temporal change and the spatial patterns vary by pollutant. Population-weighted averages of annual average concentrations for Census Block Groups based on the predictions at Census Block centroids show similar means and narrow variability compared to those at monitoring sites (Table 2). Predicted concentrations and their uncertainties National prediction of concentrations of criteria pollutants using a parsimonious approach for all Block Groups, Tracts, Counties, and States in the contiguous U.S. are publicly and freely available online at https://www.caces.us.

Discussion
We built and tested IEG models for six pollutants for all years with national monitoring data during 1979-2015 in the contiguous U.S.; predictions for "best-performing" models are publicly available online. We explore systematically the possibility of building parsimonious prediction models: how model performance changes when models are built by empirically selecting more or fewer variables from a large set to be incorporated in PLS dimension reduction.
Our findings indicate that models based on empirically selecting subsets of variables outperform or perform as well as "all variable" models: we find good model performance using a relatively small numbers of variables between 3 and 30 variables across pollutants and years. These findings suggest the good prediction ability of parsimonious prediction models when applied to prediction at unmeasured locations. In addition, satellite-derived estimates of air pollution and of land use were commonly-selected variables in the IEG models, indicating the importance of including these estimates to parsimonious models.
An important motivator for this research question is the considerable effort and computational intensity of tabulating hundreds of geographic variables at all prediction locations; that effort and computational intensity is a barrier to widespread usage of national IEG models. For example, in the U.S., the number of prediction locations could approach several thousand of people's homes in a project-based cohort and more than million Block centroids in Census.

Pollutant
Year National prediction of concentrations of criteria pollutants using a parsimonious approach This limitation impacts the feasibility of subsequent analyses in epidemiology, exposure assessment, environmental justice, and other fields. As the spatial domain for air pollution exposure models and health analyses is expanded to national or global scales [4,7,21,22,23,25], data and processing requirements will grow as additional input data are needed to improve prediction ability. Our approach reveals which predictive variables are most important for generating parsimonious prediction models that outperform or perform as well as all-variable models. Model performance varied by pollutant, with better performance for PM 2.5 , NO 2 , and ozone than for CO, SO 2 , and PM 10 . All models benefited from introducing at least a small number of geographic covariates. Model performance is similar for ordinary kriging (with no variables) and IEG "best models" (mostly including 3-30 variables) for SO 2 and to some extent for ozone when using conventional CV; using clustered CV, none of the pollutants exhibit similar performance between the "best" IEG and ordinary kriging models (although the gap is smallest for ozone). In general, ordinary kriging models deliver zero or near-zero value with clustered CV. This is sensible since kriging cannot predict to areas not incorporated in the model, as is required in the clustered CV.
Differences in model performance may reflect differences in chemistry and physics of the pollutants, spatial patterns of emissions, quality of input data, correlation with land uses, availability of relevant satellite data, and/or a design of monitoring networks (number of monitors and their placement). For example, the gap between ordinary kriging and "best" IEG is larger for NO 2 than for PM 2.5 , reflecting spatial patterns are more homogeneous for PM 2.5 than for NO 2 ; and, the number of monitors is~3× larger for PM 2.5 than NO 2 . The extant monitoring network is designed for regulatory purposes: mainly, to test for compliance with National Ambient Air Quality Standards (NAAQS). As the use of IEG models grows, EPA or others could consider utility to IEG models (e.g., monitoring in locations with a variety of land uses) as an additional goal.
Some of our findings could have been sensitive to our methodological choices. However, results of our sensitivity analyses support our main findings. The slightly worse performance of the models using all variables as compared to models using some variables could potentially be driven by the fact that we treat the selected variables as fixed and do not include the selection process in our model evaluation. However, when we additionally include forward selection in our evaluation in one of our sensitivity analyses, the results are consistent with the original findings at least in clustered CV. Conventional CV shows similar performance across models using small versus all sets of variables. When we apply random selection instead of forward selection, more variables are needed; models including 50-150 variables, as opposed to 3-30 with forward selection, have performance similar to the all variable models. However, the minimal change in R 2 suggests that some variable models still perform as adequately as the all variable models. In addition, good performance with reduced numbers of variables, consistently shown both in conventional and clustered CV, also supports good prediction ability of using subsets of variables, indicating the possibility of applying a parsimonious prediction approach to predict in areas without monitors. This finding also highlights the importance of clustered CV when evaluating observation-driven models. The relatively poor performance of clustered CV for some pollutants also suggests investigators should be careful in their application of prediction models to locally-affected pollutants when the monitoring network is spatially sparse.
Our results highlight the importance of satellite-derived air pollution data for IEG [20]; satellite data were selected as one or more of the top five variables consistently across all pollutants and years. The most commonly selected satellite estimates were satellite PM 2.5 derived from AOD data for PM 2.5 models and HCHO for ozone models. Considering IEG model performance when a category of variables is excluded, the performance decline was greatest excluding satellite data in comparison to excluding other covariate data, especially with clustered CV.
A common concern for IEG models such as those generated here is whether regulatory monitoring data can allow our model to provide accurate predictions for people. The distribution of values for geographic variables such as land use characteristics might differ at monitoring locations relative to prediction locations where people live [41,42]. Monitors may be located in areas where few people live and may not be able to represent people's exposures. When we compared the distribution of geographic variables between all available regulatory monitoring sites and Census Block centroids, for 95% and 98% of~350 variables, the standard deviation for Census Block centroids is 2.5 and 5 times standard deviation for monitoring sites. This finding indicates that the range of values at Block centroids exceeds the range at monitoring sites, suggesting that there could be some spatial misalignment between monitoring and prediction locations in our work. However, because our models use estimated PLS predictors instead of direct measures of variables, extreme values of a few variables are less likely to impact model predictions [37].
Although we focus on investigating the impact of model parsimony on model performance within the IEG framework instead of between IEG and other modeling approaches, our IEG models provide consistent model performance when compared to previous studies. The IEG models using some to all geographic variables for 2000-2010 shows cross-validated R 2 s of 0.84-0.86 for PM 2.5 and 0.81-0.87 for NO 2 , which are similar to estimates from other studies in the contiguous U.S. for the overlapping periods: 0.78-0.88 for PM 2.5 in neural network and 0.78-0.82 for NO 2 in land use regression [4,21,23]. We did not use direct measures of variables. In a previous study [37], using selected variables alone gave extreme predictions at a few subject locations although the model performance based on monitoring data was similar. These extreme predictions occurred because of widely different geographic characteristics at subject homes from those at monitoring sites. This earlier finding suggested a caution when we use a few selected geographic variables instead of summary predictors estimated by dimension reduction methods. In the current study (and as would be expected in most studies), the distribution of values for geographic variables at the census tract centroids is generally wider than that of monitoring sites, indicating the possibility of producing inaccurate health effect estimates when direct measures of variables are used. As another alternative approach, our future work will compare this IEG model and a machine learning approach based on the same input data.
The incorporation of variable selection into PLS regression was widely applied in previous studies [38] but rarely recognized in environmental epidemiology. For example, genetic studies used these approaches to identify the list of genetic characteristics related to health outcomes from numerous genetic information. These combined approaches of variable selection to PLS select the variables based on PLS properties such as PLS loadings and scores, or iterative procedures between variable selection and PLS model fitting. Although there is a similarity with respect to combining variable selection and PLS between those approaches and ours, there are two major distinctions. First, the combined approaches test subsets of variables in PLS based on good understanding of model parsimony. In genetic analyses where most genetic variables are irrelevant to outcome such as disease status, there is a strong consensus that adding noisy variables impairs model performance. In contrast, hundreds of geographic variables computed for air pollution prediction are considered as potential pollution sources and being related to air pollution, and previous studies have expanded lists of variables. Secondly, based on little knowledge of benefit of including a large set of geographic variables, our aim of using the combined approach was the investigation of model parsimony for prediction rather than the selection of a small subset of variables. Our finding of better model performance using a small subset of variables was not previously investigated in air pollution studies.
Our study has several limitations to motivate future research. We consider only spatial aspects of IEG models and use many temporally-fixed geographic variables (exceptions include satellite-derived estimates of air pollution concentrations, and land use variables for the 1970s and 2006). Future studies could also add variables that represent geographic characteristics changing over time. In addition, temporal correlation over years also can contribute to prediction when we develop a prediction model in a spatio-temporal framework. Future work could build national, publicly available models with finer temporal resolution than here (i.e., better than annual-averages) and could test model parsimony with respect to temporal models or spatiotemporal models. As alternative modeling approaches, non-linear PLS and additive models could be considered. Other modeling approaches particularly for various machine learning approaches applied in recent studies should be investigated [4,29]. Satellite air pollution estimates employed here for NO 2 , SO 2 , and HCHO are tropospheric column abundance, rather than ground-level estimates. Previous studies have shown that NO 2 model improvements from satellite-derived estimates of air pollution are similar for column-total as for ground-level estimates [21,43]; future work could test that finding for SO 2 and HCHO. The present research employed emission estimates, which are an input into chemical transport models (CTMs), and prior research has included CTM as an input to IEG model-building [4,44]. Future research could test the role of model parsimony in IEGs that incorporate CTM output. Future IEG models could also potentially include national datasets on traffic volumes, vehicle fleet composition, enhanced urban form estimates from Landsat imagery or point-of-interest data, and recentlylaunched satellites. We hypothesize that such datasets would improve IEG model performance, though recognizing that because the IEG models already have many inputs (including satellitebased estimates of air pollution concentrations), new datasets may or may not improve model performance appreciably. For the variables characterized as summaries within circular buffers of several distances, transport network distance can be an alternative approach for the current Euclidian distance. Lastly, we found that the performance of the models with the randomly selected subsets of variables, although we need more variables than in forward selection, was as good as the models using the full set. This may indicate high correlation between many variables. Future studies need further investigation to confirm this finding.
In summary, this study provides important findings on cost-effective approaches for national-scale air pollution prediction. Results indicate that national IEG model performance can be similar or better, depending on the pollutant and evaluation approach, if built on only a small number of empirically selected covariates from hundreds, relative to models built using all of those variables. This finding suggests good applicability of parsimonious models to predicting at any locations in a country. Our model predictions for the contiguous U.S. are freely available online, at https://www.caces.us.