Median regression spline modeling of longitudinal FEV1 measurements in cystic fibrosis (CF) and chronic obstructive pulmonary disease (COPD) patients

Rationale Clinical phenotyping, therapeutic investigations as well as genomic, airway secretion metabolomic and metagenomic investigations can benefit from robust, nonlinear modeling of FEV1 in individual subjects. We demonstrate the utility of measuring FEV1 dynamics in representative cystic fibrosis (CF) and chronic obstructive pulmonary disease (COPD) populations. Methods Individual FEV1 data from CF and COPD subjects were modeled by estimating median regression splines and their predicted first and second derivatives. Classes were created from variables that capture the dynamics of these curves in both cohorts. Results Nine FEV1 dynamic variables were identified from the splines and their predicted derivatives in individuals with CF (n = 177) and COPD (n = 374). Three FEV1 dynamic classes (i.e. stable, intermediate and hypervariable) were generated and described using these variables from both cohorts. In the CF cohort, the FEV1 hypervariable class (HV) was associated with a clinically unstable, female-dominated phenotypes while stable FEV1 class (S) individuals were highly associated with the male-dominated milder clinical phenotype. In the COPD cohort, associations were found between the FEV1 dynamic classes, the COPD GOLD grades, with exacerbation frequency and symptoms. Conclusion Nonlinear modeling of FEV1 with splines provides new insights and is useful in characterizing CF and COPD clinical phenotypes.


Introduction
Longitudinal lung function assessments in Cystic Fibrosis (CF) and Chronic Obstructive Pulmonary Disease (COPD) are critical for monitoring disease progression and response to therapy. Population-based modeling of lung function typically uses least-squares trendlines, median quantile linear regression, or smoothing spline curves [1][2][3][4][5][6][7][8][9]. These approaches are useful in assessing long-term disease progression as well as genomic/genetic risk in both CF and COPD populations. However, some studies including airway secretion microbiome and metabolomic studies require quantification of individual short-term (i.e. weeks to months) lung function variability. In most cases, individual lung function data is modeled with leastsquares trendlines focusing on the slopes and intercepts [5,10]. At least two investigations focused on individual variability as a marker of disease risk in CF and COPD populations [10,11]. However, these studies used one or two variables and only captured some of the full dynamic complexity of a specific individual's lung function.
The purpose of this analysis was to model lung function dynamics of individuals with median regression splines in order to capture clinically relevant, subject-specific variability. Furthermore we aimed to determine if this variability provides additional insights to CF and established COPD phenotypes as well as commonly assessed clinical parameter.

Study populations
The study was reviewed and approved by the UCSD HRPP (application #081500) and the Norway Regional Ethics Committee (REK 165.08) and performed in accordance with the Declaration of Helsinki and the Good Clinical Practice guidelines. All subjects signed informed consent.
The CF population consisted of all adult (!18 years of age) patients seen at the San Diego adult CF program at least once during the calendar year 2014 (n = 177). For each CF patient, we collected basic demographic (gender, age), nutritional data (height and weight) and all available clinically-indicated spirometry results dating back to 1/1/2012. Spirometry was performed in accordance with ATS guidelines and normalized using the Hankinson equations [12]. The age Ã FEV 1 % predicted (AF) product is a derived variable calculated by multiplying the age of the individual at the most recent FEV 1 by the best FEV 1 % during the previous year and is used to assess disease risk [5,13]. Each CF subject was mapped into a specific CF multidimensional clinical phenotype (MDCP) based on the most recent clinical data and the published random forests prediction model [13]. These clinical phenotypes were generated using age, gender, FEV 1 , FVC, height and weight and two derived variables: Body Mass Index (BMI) and the AF product (S1 Table). These phenotypes represent clinically relevant groups that differ not only in the class forming variables but also in microbiology (13).
The COPD data derives from the Bergen Cohort COPD Study (BCCS) [14,15]. Only patients with a minimum of four visits were included in the present study (n = 374). BCCS evaluations were attempted every 6 months for 3 years with an additional visit 3 months after  [16]. Lung function was normalized to the ECCS 1993 equations [17]. The COPD subjects were classified into their specific GOLD grades of airway obstruction using their FEV 1 % values.
Nonlinear modeling of Forced expiratory volume 1 (FEV 1 ) We modeled each subject's FEV 1 % using the estimated median regression spline. A median spline is similar to the traditional cubic smoothing spline except that the conditional median is estimated. The estimation of this spline is reproducible, and generates statistically robust smoothing curves. The median quantile regression spline is not sensitive to technical or biological outlier values [18]. Details of fitting the median splines are found in the Supporting Information (S1 Fig).
For each subject, the time between the first and most recent spirometry was divided into 1000 equally spaced time points. The predicted spline values as well as the estimated first and second derivatives were determined at each time point. From these, the following nine summary variables were determined for each subject: a) the effective degrees of freedom of the median spline, b) the range between the 95 th and 5 th percentile of the median spline and the derivatives, c) the number of local minima and maxima of the median spline and the derivatives, d) and the median values of the first and second derivative of the splines. The effective degrees-of-freedom is a proxy for the approximate degree of the equivalent polynomial fit needed to generate the median spline and is thus a measure of the variability of the data. Similarly, the number of local maxima and minima identify the instances when the spline or derivatives change direction; another but distinct measure of variability. The range between the 5 th and 95 th percentile values of the spline and derivatives quantifies the magnitude of change occurring in these curves during the study period. Finally, the median values of the first and second derivatives identify the overall rate of change of lung function and the general curvature of the median spline, respectively (S1 Fig). In addition to these variables, the slopes of the least-squares trendlines and the median of differences from the best FEV 1 % were estimated for each patient [10].

FEV 1 dynamic class formation
Class formation used the algorithms previously described [13]. The CF and COPD cohorts were analyzed separately. The nine variables were used to generate a proximity matrix using the unsupervised Random Forests algorithm [13,19]. Partitioning Around Medoids (PAM) clustering was used to create the classes (i.e. k = 3, 4 or 5) [20]. The classification error rates (i.e. out of bag error rates) were minimized using the three-class strategy (i.e. k = 3).
Dimension reduction demonstrated that the use of four variables (the effective degrees of freedom and the ranges between the 5 th and 95 th percentile of the median regression spline and the two derivatives) created classes with out-of-bag error rates of about 8%.

Patient population
For this study, 177 adult CF patients signed consent and had complete datasets including a minimum of 4 spirometry studies (Table 1). This patient population did not differ significantly from other adult CF populations in terms of age, gender, nutritional parameters or lung physiology [13,24]. There was a median of six spirometric assessments per patient in this cohort. Each CF subject was mapped to one of five regional clinical phenotypes and as a group they closely resemble the original published phenotypes (Fig 1A) [13]. The distribution of demographic and physiological parameters between the GOLD grades for the Bergen Cohort COPD Study is shown in Fig 1B [ 14,15]. There were no clinically significant differences in age, gender,   Fig 1A) and COPD ( Fig 1B) cohorts. Each subject was mapped to either a MDCP class [13] or COPD GOLD grade. The AF product of an individual is the age at the most recent FEV 1 multiplied by the best FEV 1 % during the study period. The linear slope is the slope of the least squares trendline fitted to the FEV 1 data during the study period. https://doi.org/10.1371/journal.pone.0190061.g001 BMI or loss of lung function as assessed by the slope of the least-squares trendlines between the GOLD grades. Patients in GOLD 3 and 4 had higher MMRC scores. Lung function dynamic variables in CF MDCP classes and in the COPD GOLD grades. In the CF population, there were no differences in the dynamic variables including the slope of the least-squares trendlines between subjects mapped into the published regional clinical phenotypes except for minor differences in the ranges of the median regression splines and their estimated first derivatives (Fig 2A). Similarly, the BCCS cohort demonstrated slightly smaller ranges of the splines and estimated first derivatives in GOLD grade 4 compared to grade 2. There were no significant differences in the other dynamic variables between the GOLD grades 2-4 ( Fig 2B).

FEV 1 dynamic class formation
These modest associations of the individual FEV 1 dynamic variables with the CF clinical phenotypes and COPD GOLD grades suggested that combinations of the variables were needed to characterize the dynamics of the median regression splines. In the CF cohort, the classification error rates using the k = 3 strategy decreased slightly from 10% to 8% with the use of just four of the original nine variables i.e. effective degrees of freedom, and the ranges between the 5 th and 95 th percentile values of the spline, the first and second derivatives. Similar results were obtained with the COPD cohort data. In both cohorts, the classification error rates were the lowest with the k = 3 and increased with the k = 4,5 or 6 strategies.
The three resulting classes or lung function phenotypes differed significantly in terms of the ranges of the splines and the derivatives as well as the effective degrees of freedom. The groups with the highest and lowest ranges and effective degree of freedom were labeled as hypervariable (HV) and stable (S) respectively. The class with values that were between was labeled intermediate (I). The four classifying variables were clearly separated between the CF and COPD classes (Fig 3).
Spaghetti plots of the CF subjects show the values of the median spline and the first two derivatives as a function of time (Fig 4). These plots demonstrated that all three classes of splines had similar absolute values and general rate of decline in FEV 1 % predicted. In contrast, the derivative plots clearly demonstrate that the FEV 1 dynamic class S has little dynamic change whereas class HV was hypervariable. Class I had intermediate variability.
Spaghetti plots of the three COPD FEV 1 dynamic classes showed similar median spline values and the mean rates of lung function loss in each of the GOLD grades ( Fig 5). As with the CF cohort, the plots of the derivatives clearly showed that the COPD FEV 1 dynamic class S was very stable compared to class HV. Again, class I demonstrated intermediate variability.

FEV 1 dynamic classes versus clinical parameters
In the CF cohort, there were no differences in age, AF product, BMI, or slope of the leastsquares trendlines between subjects in the three lung dynamic classes (Fig 6). Similarly, subjects in the BCCS FEV1 dynamic classes did not differ significantly in terms of age, AF product, BMI or slopes of the least-squares trendlines. The median difference from the best FEV 1 % is associated with a faster rate of lung function decline in CF patients [10]. Compared to CF patients with the stable phenotype, the hypervariable CF subjects had a higher median difference from the best FEV 1 % (HV/S ratio of 1.71 with a p value < .001) which is compatible with the higher risk associated with this hypervariable lung function phenotype.

FEV 1 dynamic classes vs clinical phenotypes
Mosaic plots were used to compare the distribution of categorical variables between the FEV 1 dynamic classes (Fig 7). Mosaic plots assume proportional distribution of the categorical variables to be the same in the subclasses as it is in the whole population [22,23]. The size of the boxes is proportional to the number of subjects that share the attributes of the categories. Boxes that statistically deviate significantly from expected result are shaded. In the CF cohort, the stable CF FEV 1 dynamic class (S) occurred 2.1 times as often than expected (expected values calculated by the Chi-squared test of independence) in male subjects of the regional CF MDCP D, a milder clinical phenotype (Fig 7A). In contrast, the females in regional CF MDCP C were classified twice as much as expected with the hypervariable FEV 1 dynamic class HV. This clinical phenotype is female-dominated and has a high rate of phenotype transition over a three-year period [13]. Finally another strong association noted in CF data is one between the FEV 1 dynamic class I with male subjects in regional CF MDCP B (2.1 times the expected level), an older, male-dominated phenotype with poor lung function and nutritional status and a high rate of death and lung transplantation.
The COPD cohort demonstrated a strong positive association between the stable COPD FEV1 dynamic class S in GOLD grade 4 patients with the number of pulmonary exacerbations (4.0 times the expected value) and high MMRC dyspnea scores (2.4 times the expected value) (Fig 7B and 7C). In contrast, the more variable COPD FEV1 dynamic class I was associated with milder GOLD grade 2 and lower MMRC dyspnea scores (1.3 times expected value). In a

Discussion
We demonstrate the feasibility of modeling individual FEV 1 % with median regression splines in CF and COPD subjects. Variables capturing the dynamics of the splines and their predicted first and second derivatives were used to form lung function variability classes. A dimension reduction strategy that reduced the data complexity identified groups of subjects with similar spline dynamics. This strategy retained variables which capture the magnitude and variability of the changes in these curves and maintained classification error rates in the 5-10% range. In larger studies, it will likely be more useful to further refine the FEV 1 dynamic classes by using a greater number of the variables for class formation.
The dynamics of FEV 1 were not included in the formation of the regional CF phenotypes and yet the different airway lung function dynamic classes were enriched in specific clinical phenotypes. For example, males in the regional CF clinical phenotype, CF MDCP D, were positively associated with the most stable FEV 1 dynamic class S consistent with this phenotype's milder, more stable clinical course. The intermediate CF FEV 1 dynamic class I was over-represented in CF MDCP B, another older, male-dominated phenotype that is associated with the  highest rates of death and lung transplantation. Finally, the hypervariable CF lung function phenotype was seen more frequently than expected in two female-dominated CF clinical phenotypes, MDCP C and E. CF MDCP C is clinically unstable with the highest rate of phenotype transition over three years and a high frequency of sputum cultures positive for fungal species  https://doi.org/10.1371/journal.pone.0190061.g007 [13]. It is unclear if the distinctive characteristics of the airway microbiome in these two phenotypes are responsible for the lung function variability.
The modeled lung function in the COPD cohort behaved distinctly from the CF cohort. The ranges of the estimated second derivatives of the COPD cohort was approximately 10% of the CF cohort which likely resulted from differences in the pathophysiology of the disease and the indications for the spirometry assessments i.e. irregularly-timed, clinically indicated studies for the CF population versus more regularly-spaced research spirometries in the BCCS cohort. Although the COPD cohort showed no strong associations between gender, the FEV 1 dynamic classes and GOLD 2,3,or 4 grades, it surprisingly showed positive associations between the stable COPD FEV 1 dynamic class (S), the severe GOLD grade 4 and both the number of exacerbations and higher MMRC dyspnea scores. The more variable FEV 1 dynamic classes (class I and HV) were enriched in subjects with milder GOLD grades 2 and 3 and lower pulmonary exacerbations rates and MMRC dyspnea scores. In this COPD cohort lung function variability appeared inversely associated with the MMRC scores, exacerbation rates and underlying FEV 1 %. A survivor effect might explain the overrepresentation of the stable FEV 1 dynamic class in GOLD grade 4. With very low FEV 1 , even small changes in FEV1 may lead to exacerbation.
This analysis modeled individual lung function measurements over time, summarized the changes, and then used the summary statistics to generate groups of subjects with similar dynamic behavior. Depending on the modeling objective, there are many alternative statistical models that can be considered. Linear mixed models are a common approach for modeling longitudinal data, where population-level mean response is modeled as a linear function of time and subject specific random effects are used to characterize the between subject and within subject variations. Szczesniak et al. successfully extended this approach in CF patients by using semi-parametric mixed models and penalized regression splines for a more flexible mean structure in the mixed modeling approach [25]. Their modeling of FEV 1 % decline using semi-parametric nonlinear models suggested that the trends for survivors and non-survivors diverged around 12 years of age. Using a Bayesian framework, Moss et al. investigated several change point models including mixture models, for the decline of lung function in children and adolescents with CF [8]. The objective was to estimate the magnitude of lung function decline, along with the factors associated with the decline, and still account for individual changes over time. The results supported the hypothesis that there are two groups of CF adolescents, one with a change point in the decline of lung function and one without a change point. Recently, Szczesniak et al. used the statistical approach of sparse functional principal components analysis to classify patients into distinct phenotypes using longitudinal FEV1 trajectories [9]. The classes were determined by a quantile rule using the first and third tertiles of scores from the first functional principal component. This is in contrast to our study where the groups were determined by the data in an unsupervised method.
Two additional studies demonstrated the utility of characterizing individual variability of FEV 1 % in CF and COPD populations using a single measure of lung function variability [10,11]. Morgan et al. demonstrated that the median deviation from the best FEV 1 was the best predictor of lung function loss over a two-year period. Casanova et al. looked at a very different model of individual FEV 1 variability i.e. the number of spirometry studies that exceeded the annual loss of lung function and found that FEV 1 variability was not significantly associated with 2 year mortality. In contrast, this current study captures the complexity of lung function dynamics using multiple measures of the variability of the median regression splines.
There are several limitations of this study. The generalizability of the clinical associations presented in this study will be limited to the extent that the datasets may not represent the adult CF population or the general COPD population. Comparisons of basic demographic data of the CF cohort and the CF Foundation Patient Registry and also the BCCS cohort to the broader ECLIPSE dataset suggest that the findings will generally hold up in larger studies [4,24]. Another limitation involves how the number and timing of the spirometry data affects spline dynamics or FEV 1 dynamic class assignment in both clinical cohorts. We expect that stable patients with milder CF clinical phenotypes would have fewer studies and would more likely be classified into the stable FEV1 dynamic class. To the extent that CF dynamic class formation is based on real life use of spirometry, CF lung function class formation with encounter-based registry data may improve its associations with clinical phenotypes. In contrast, mapping COPD patients with clinically-indicated spirometry data into these COPD lung function dynamic classes would likely not provide valid insights.
In summary, nonlinear modeling of lung function using median regression splines provides unique opportunities to quantify short-term airway physiology variability and to associate this variability with important clinic parameters including clinical phenotypes and other dynamic factors that alter airway physiology such as the airway microbiome and metabolome or the host acute immune response.

S1 Fig. Median regression spline and derivative generation and dynamic variables.
Each subject's FEV 1 % is modeled and estimated using a median regression spline with the smoothing parameter chosen by generalized cross validation (GCV) based on the quantile function. The median smoothing spline is the solution to a minimization problem and fits a piecewise cubic polynomial with the join points at the unique set of time or x-values. The piecewise polynomials are constructed so that the entire curve has continuous first and second derivatives. The analysis and computations were performed using R and the qsreg median spline regression function with the default parameters [21,26]. Case Report. A case report demonstrates how the median regression spline was used to capture the short-term dynamics of the FEV 1 % predicted. The patient is a 43-year old, pancreatic sufficient woman with a CFTR genotype of dF508/Q372Q. The latter, synonymous mutation is a variation of a canonical splice site sequence at the exon-intron boundary at Exon 7. Although dF508 is considered not responsive to ivacaftor therapy, the responsiveness of this specific Q372Q mutation was unclear. The patient grows Escherichia coli, Aspergillus fumigatus, Scedosporium azoospermia and methicillin-sensitive Staphylococcus aureus. Therapies include inhaled 7% hypertonic saline, recombinant human DNase, and tobramycin. Oral medications include the chronic use of oral azithromycin and voriconazole. In the past, she was on chronic inhaled amphotericin because of recurrent hemoptysis. The Figure demonstrates the summary variables of the median regression spline that were used to capture the dynamics of the FEV 1 used in the CF and COPD cohorts. The measured FEV 1 % predicted values (black circles) and the median spline (black curve) are shown for the patient in the case report. Also depicted, are the periods of oral antibiotic (horizontal grey) and intravenous antibiotic (horizontal black) therapy. Off-label use of ivacaftor therapy is indicated with the horizontal red bar. The interval between the earliest and most recent FEV 1 value was subdivided into 1000 equal spaced points. From these points, the median, as well as the 5 th and 95 th percentile values were predicted. Additional derived variables include the range between the values of 5 th and 95 th percentile points (vertical grey) bar and the number of local maxima and minima (red triangles). These variables were also identified in the first and second derivatives. Finally, in addition to these variables, the slope of the least squares regression line (thin grey trendline) and the effective degrees of freedom of the median regression spline were estimated. (TIF) S1 Table. Multi-dimensional Clinical Phenotypes (MDCPs). The mean values of the FEV1% predicted (NHANES III), Brasfield chest Xray score, age, AF product, BMI and percent of male subjects in each of the regional Cystc Fibrosis clinical phenotypes. Adapted from PLOS ONE 2015;10:e0122705 (13