## Figures

## Abstract

To increase power and minimize bias in statistical analyses, quantitative outcomes are often adjusted for precision and confounding variables using standard regression approaches. The outcome is modeled as a linear function of the precision variables and confounders; however, for many complex phenotypes, the assumptions of the linear regression models are not always met. As an alternative, we used neural networks for the modeling of complex phenotypes and covariate adjustments. We compared the prediction accuracy of the neural network models to that of classical approaches based on linear regression. Using data from the UK Biobank, COPDGene study, and Childhood Asthma Management Program (CAMP), we examined the features of neural networks in this context and compared them with traditional regression approaches for prediction of three outcomes: forced expiratory volume in one second (FEV_{1}), age at smoking cessation, and log transformation of age at smoking cessation (due to age at smoking cessation being right-skewed). We used mean squared error to compare neural network and regression models, and found the models performed similarly unless the observed distribution of the phenotype was skewed, in which case the neural network had smaller mean squared error. Our results suggest neural network models have an advantage over standard regression approaches when the phenotypic distribution is skewed. However, when the distribution is not skewed, the approaches performed similarly. Our findings are relevant to studies that analyze phenotypes that are skewed by nature or where the phenotype of interest is skewed as a result of the ascertainment condition.

**Citation: **Voorhies K, Bie R, Hokanson JE, Weiss ST, Chen Wu A, Hecker J, et al. (2022) Covariate adjustment of spirometric and smoking phenotypes: The potential of neural network models. PLoS ONE 17(5):
e0266752.
https://doi.org/10.1371/journal.pone.0266752

**Editor: **So Young Ryu,
University of Nevada Reno, UNITED STATES

**Received: **October 5, 2021; **Accepted: **March 27, 2022; **Published: ** May 11, 2022

**Copyright: ** © 2022 Voorhies et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **Data are publicly available for the COPDGene study (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000179.v1.p1) and UK Biobank (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access).

**Funding: **This research was funded by National Heart, Lung, & Blood Institute grant number K01HL125858, U01HL089897, U01HL089856, P01HL132825, the Eunice Kennedy Shriver National Institute of Child Health and Human Development grant number R01HD085993, and the National Institute Of Mental Health grant number R01MH129337. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** Michael H. Cho has received grant funding from GSK and Bayer, and consulting or speaking fees from AstraZeneca, Illumina, and Genentech. Edwin Silverman has received grant funding from Bayer and GSK. Ann Chen Wu has received grant funding from GSK. Dawn L. Demeo has received grant funding from Bayer and honoraria from Novartis.

## Introduction

In epidemiological studies of respiratory diseases and smoking phenotypes, prediction models are often fit using standard linear regression. However, a linear regression model assumes there is a linear relationship between the mean of the phenotype and the covariates. While this might be a reasonable assumption for some parts of the phenotypic range, it is questionable whether linearity holds in the tails of the distribution, especially when diseased populations are analyzed and the majority of study subjects have phenotypic values that are in the tails of the distribution.

Neural networks, a well-developed deep learning approach [1], can describe non-linear relationships between predictors and outcomes and are often able to achieve more accurate prediction than those based on linear regression, making them potentially useful for predicting complex respiratory phenotypes and smoking traits. Two important questions in epidemiology are hypothesis testing and prediction. Hypothesis testing focuses on whether a variable X is associated with an outcome Y, and whether other variables are confounders or precision variables. Prediction focuses on improving predictive accuracy by including all covariates with appropriate forms that improve the prediction and excluding covariates that do not improve the prediction accuracy of the model. Machine learning methods can provide a tool to investigate covariates to include and forms of covariates to be used.

Previous work found machine learning methods can predict smoking cessation and forced expiratory volume in one second (FEV_{1}), a spirometric measure used to determine COPD severity [2–4]. In particular, radial basis neural network predicted FEV_{1} using spirometry data [5], and spirometry and demographic data [6], and the predicted and actual FEV_{1} values were highly correlated. However, prediction accuracy was better for normal rather than restrictive or obstructive diseased condition [5, 6]. Therefore, there is evidence machine learning and deep learning methods can be used to predict these outcomes, and they can offer advantages over other models in some circumstances.

We evaluated the prediction properties of neural network models as compared to standard regression models. We used data from the UK Biobank [7], the COPDGene study [8], and the Childhood Asthma Management Program (CAMP) [9] to assess the performance of both approaches by comparing the test mean squared error (MSE) of each approach and each data set. For each study we predicted FEV_{1}, and using the UK Biobank and COPDGene study, we also predicted age at smoking cessation and log age at smoking cessation.

## Methodology

For the linear regression model, let *y*_{i} denote the outcome, where *i* is the *i*^{th} study subject. Let *k* be the number of covariates . To simplify, we denoted the covariate matrix as *X* and **x _{i}** is the

*i*

^{th}subject in the matrix. We assumed a linear relationship and used the training set to estimate parameters in the following equation: (1)

Neural networks are made up of layers of neurons, and the number of neurons and layers can vary depending on the data. The input layer of the neural network has a neuron for each of the predictors from the data set being used, any hidden layers each have the number of neurons specified by the user, and the output layer has one neuron when predicting a single continuous outcome [10]. The number of hidden layers and neurons for each hidden layer are typically determined by trial and error. For this study, we used two hidden layers. Each neuron has an associated weight, and the sum of the neurons multiplied by their weights is input into an activation function, which outputs to the next layer. Activation functions are specified for each hidden layer and the output layer.

For the neural network model, suppose there are *p* layers in the model denoted *L*_{1}, *L*_{2}, ⋯, *L*_{p}. For the *i*^{th} layer, there are *n*_{i} neurons, each neuron is denoted , and the layer uses activation function *ϕ*_{i}. The activation function works as a link function and converts the input signal to the output signal on a node. For example, a linear activation function is *g*(*x*) = *x*, which is commonly used in linear regression models, while a non-linear activation function, such as sigmoid function , can be used in a neural network model. Karlik and Olgac (2011), and Sibi et al. (2013) provide more details and comparison of activation functions [11, 12]. The following equation is used for calculating , the *j*^{th} neuron in the *i*^{th} + 1 layer:
(2)
where is the weight for the *k*^{th} neuron in the *i*^{th} layer.

To evaluate prediction accuracy, we applied the trained models on the test data to predict FEV_{1}, age at smoking cessation, and log age at smoking cessation. We used data from the UK Biobank, COPDGene study, and CAMP. The UK Biobank is a large prospective study [7], COPDGene is a study of smokers in which participants were enrolled based on COPD affection status [8], and CAMP is a study of children with asthma [9]. For the UK Biobank and CAMP, we included subjects of European ancestry. For the COPDGene study, we included African American and non-Hispanic white participants in separate models. Ethnicity was based on self-report. To predict FEV_{1}, the models included age, sex, BMI, centered height, and squared centered height as covariates. According to previous literature, these are common factors that may be associated with FEV_{1} [13, 14]. Height and height squared were centered to reduce correlation between these two covariates. We considered two samples for prediction of FEV_{1} using the UK Biobank data, one sample which included all subjects, and another sample which only included a subset of subjects with the lowest 20% of FEV_{1} measurements to create ascertainment bias. To predict age at smoking cessation and log age at smoking cessation, we included former smokers, and the models included age, sex, age started smoking, education (attended college or university), pack years of cigarettes, and smoker in household. Age at smoking cessation was measured in the UK Biobank by asking participants who had stopped smoking “At what age did you give up?”, and in the COPDGene study by asking participants “How old were you when you completely stopped smoking?”. Characteristics of subjects are shown in Table 1.

For continuous variables, we give the mean and standard deviation (i.e. mean (sd)). Sample 1 is for FEV_{1} as the outcome. Sample 2 is for age at smoking cessation as the outcome and includes former smokers. Sample 3 is for FEV_{1} as the outcome for the subjects with the lowest 20% of FEV_{1}.

We randomly selected 1,000 subsets of the data sets to compare the mean test MSE for the neural network and linear regression models where 50%, 25%, or 10% of the sampled data was used as the test data. Each model was trained using the other 50%, 75%, or 90% of the sampled data. Activation functions used and number of neurons for each model are included in Table 2, and the architecture of the models is shown in S4 and S5 Figs in S1 Appendix. As seen in Table 2, we used sigmoid functions for FEV_{1}, hard sigmoid and rectified linear unit (RELU) for smoking cessation, and sigmoid functions for log smoking cessation. Analyses were done in R, and we used the package ‘Keras’ for the neural network analyses [15], and the package ‘caret’ for partitioning the data into the test and training data sets [16].

## Data analysis

We applied the neural network models and linear regression models to predict FEV_{1} using the UK Biobank data among subjects of European ancestry (N = 151,879), a subset of the UK Biobank data among subjects of European ancestry limited to subjects with the lowest 20% of FEV_{1} measurements (N = 29,805), COPDGene study data among non-Hispanic white subjects (N = 6,764), COPDGene study data among African American subjects (N = 3,365), and CAMP data among subjects of European ancestry (N = 698), and to predict age at smoking cessation and log age at smoking cessation using the UK Biobank data among subjects of European ancestry (N = 21,142), COPDGene study data among non-Hispanic white subjects (N = 4,104), and COPDGene study data among African American subjects (N = 673). Note that all data is from phase 1 of the COPDGene study.

Density plots of the outcomes revealed FEV_{1} was normally distributed, but age at smoking cessation was right-skewed and could benefit from a log transformation. Density plots of the distributions are shown in Fig 1.

The plot in the top right shows the density plot of log smoking cessation (age). The plot in the bottom left shows the density plot of FEV_{1}.

We evaluated the predictive performance of the models by calculating the test MSE for each model. For every data set, we separated 50%, 75%, or 90% of the sample as the training data, and the remaining 50%, 25%, or 10% was used as the test data. Using the training data, the neural network models and the linear regression models were fit, and then these models predicted the outcome *y* for the test data.

## Results

The MSE of the test data for the linear regression and neural network models for the different data sets, sample sizes, and different proportions of data used for the test and training data are shown in Fig 2 and in the S1-S3 Figs and S1-S3 Tables in S1 Appendix. As we decreased the test data size, the standard error of the MSE increased, while the MSE was either similar for all three test data size percentages (50%, 25%, and 10%) or decreased as the percent test data decreased.

For the prediction of FEV_{1} for all subjects, the MSE was similar for neural network and linear regression across all data sets, sample sizes, and proportions of test data used except for CAMP, where the MSE for linear regression was smaller than for neural network. For the prediction of FEV_{1} for UK Biobank subjects with the lowest 20% FEV_{1} measurements, the MSE was similar for neural network and linear regression for all sample sizes and proportions of test data used. For the prediction of age at smoking cessation, the MSE was smaller for the neural network models for all data sets, sample sizes, and proportions of test data used, and thus the neural network models showed an advantage in prediction over linear regression. The neural network models showed the largest advantage over the linear regression models when examining the COPDGene study among non-Hispanic white subjects. For the COPDGene study among African American subjects, the neural network models still had a smaller MSE when predicting age at smoking cessation, however, the difference was less than in the other data sets. For the prediction of log age at smoking cessation, the MSE was smaller for neural network than linear regression across all data sets, sample sizes, and proportions of test data used except for the COPDGene study among African American subjects, for which linear regression had a slightly smaller MSE when 50% of the data was used for testing. The neural network models had the largest advantage over the linear regression models when examining the COPDGene study among non-Hispanic white subjects.

## Discussion

We used multiple permutations of subsets of the data to compare the prediction accuracy of linear regression and neural networks for three continuous outcomes, FEV_{1}, age at smoking cessation, and log age at smoking cessation. The linear regression and neural network models had similar MSE when the outcome was normally distributed (FEV_{1}), but the neural network model generally had smaller MSE than the linear regression when the outcome was not normally distributed (age at smoking cessation) or had been transformed (log age at smoking cessation). This difference was largest for the COPDGene study among non-Hispanic white subjects, and smallest for the COPDGene study among African American subjects. The subset of the COPDGene study among African American subjects had the smallest sample size for age at smoking cessation, which could be a reason we saw less of a difference in MSE between the linear regression model and neural network model for age at smoking cessation, and potentially could explain why the MSE was smaller for the linear regression when predicting log age at smoking cessation using 50% of the data to test. While neural network had better prediction accuracy in some scenarios, the interpretability of regression is superior to neural networks as the coefficients in the regression model have a straightforward interpretation.

Previous research found success in using backpropagation neural network to classify current and former smokers, with classification performance better than chance. However, compared to a logistic regression model on the same data, they found prediction was not improved when using the backpropagation neural network instead of the logistic regression [17]. Successful prediction of FEV_{1} has also been found using neural networks previously, with one study aiming to see if neural network models could predict FEV_{1} better than previously published predictions that used multiple regression analysis. Using the same sample of elderly adults as the previous model, the neural network predictions were found to correlate better to the FEV_{1} values than the predictions made by the regression analysis [18].

There were some limitations of our analysis. While we considered continuous outcomes, we did not consider binary outcomes. Additionally, while the neural network models generally had lower MSE than the regression models when the phenotypic distribution was skewed, we do not know if this is specific to the data we used or a general property of neural networks. Also, it is important to note that our observations are based on only a few predictors and three data sets. We used MSE of the test data to measure and compare prediction accuracy; however, other metrics could be used to measure model fit.

While we focused on covariate adjustment of spirometric and smoking phenotypes, future research could examine if the covariate adjustment using neural networks improves the performance of genome wide association studies (GWAS) for rare or common variants. Reducing variability in the outcome should increase power for GWAS, and it is not clear if using neural networks to improve covariate adjustment for spirometric and smoking phenotypes could lead to novel variants. While we considered outcomes related to smoking and lung function, it could be worth considering additional health outcomes in the future.

To summarize, we compared regression and neural network analyses based on test MSE, and found for our outcomes there were scenarios where the regression and neural network models performed similarly well. However, when the phenotypic distribution was skewed in our data, the neural network model had a lower average test MSE in our analyses.

## Supporting information

### S1 Appendix. Additional tables, plots, and COPDGene study information.

https://doi.org/10.1371/journal.pone.0266752.s001

(PDF)

## Acknowledgments

This research has been conducted using the UK Biobank Resource under application number 20915 (MHC).

## References

- 1.
Hagan M.T., Demuth H.B., Beale M.H. (1996). Neural network design. Pws, Boston, MA
- 2.
Coughlin L.N., Tegge A.N., Sheffer C.E., Bickel W.K. (2020). A machine-learning approach to predicting smoking cessation treatment outcomes.
*Nicotine and Tobacco Research*, 22(3),415–422. pmid:30508122 - 3.
Dumortier A., Beckjord E., Shiffman S., Sejdić E. (2016). Classifying smoking urges via machine learning.
*Computer Methods and Programs in Biomedicine*, 137, 203–213. pmid:28110725 - 4.
Arefeen M.A., Nimi S.T., Rahman M.S., Arshad S.H., Holloway J.W., Rezwan F.I. (2020). Prediction of lung function in adolescence using epigenetic aging: A machine learning approach.
*Methods Protoc*, 3(4), 77. pmid:33182250 - 5.
Manoharan S.C., Ramakrishnan S. (2009). Prediction of forced expiratory volume in pulmonary function test using radial basis neural networks and k-means clustering.
*Journal of Medical Systems*, 33(5), 347–351. pmid:19827260 - 6.
Manoharan S.C., Swaminathan R. (2009). Prediction of forced expiratory volume in normal and restrictive respiratory functions using spirometry and self-organizing map.
*Journal of Medical Engineering*&*Technology*, 33(7), 538–543. pmid:19484651 - 7.
*Learn more about UK Biobank*, https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank. - 8.
Regan E.A., Hokanson J.E., Murphy J.R., Make B., Lynch D.A., Beaty T.H., et al. (2011). Genetic epidemiology of COPD (COPDGene) study design.
*COPD*:*Journal of Chronic Obstructive Pulmonary Disease*, 7(1), 32–43. - 9.
Childhood Asthma Management Program Research Group. (1999). The childhood asthma management program (CAMP): design, rationale, and methods.
*Controlled Clinical Trials*, 20(1), 91–120. pmid:10027502 - 10.
Lantz B. (2013).
*Machine Learning with R*. Packt publishing ltd. - 11.
Karlik B., Olgac A.V. (2011). Performance analysis of various activation functions in generalized MLP architectures of neural networks.
*International Journal of Artificial Intelligence and Expert Systems*, 1(4), 111–122. - 12.
Sibi P., Jones S.A., Siddarth P. (2013). Analysis of different activation functions using back propagation neural networks.
*Journal of Theoretical and Applied Information Technology*, 47(3), 1264–1268. - 13.
Kurzius-Spencer M., Holberg C.J., Martinez F.D., Sherrill D.L. (2001). Familial correlation and segregation analysis of forced expiratory volume in one second (FEV
_{1}), with and without smoking adjustments, in a Tucson population.*Annals of human biology*, 28(2), 222–234. pmid:11293729 - 14.
Marcon A., Accordini S., de Marco R. (2009). Adjustment for baseline value in the analysis of change in FEV
_{1}over time.*Journal of Allergy and Clinical Immunology*, 124(5), 1120. pmid:19748662 - 15.
Allaire J.J., Cholett F. (2020). keras: R interface to ‘Keras’. R package version 2.3.0.0. https://CRAN.R-project.org/package=keras.
- 16.
Kuhn M. (2008). Building predictive models in R using the caret package.
*J Stat Softw*, 28(5), 1–26. - 17.
Poynton M.R., McDaniel A.M. (2006). Classification of smoking cessation status with a backpropagation neural network.
*J Biomed Inform*, 39(6), 680–686. pmid:16624625 - 18.
Botsis T., Halkiotis S. (2003). Neural networks for the prediction of spirometric reference values.
*Med Inform Internet Med*, 28(4), 299–309. pmid:14668132