Assessing eligibility for lung cancer screening using parsimonious ensemble machine learning models: A development and validation study

Background Risk-based screening for lung cancer is currently being considered in several countries; however, the optimal approach to determine eligibility remains unclear. Ensemble machine learning could support the development of highly parsimonious prediction models that maintain the performance of more complex models while maximising simplicity and generalisability, supporting the widespread adoption of personalised screening. In this work, we aimed to develop and validate ensemble machine learning models to determine eligibility for risk-based lung cancer screening. Methods and findings For model development, we used data from 216,714 ever-smokers recruited between 2006 and 2010 to the UK Biobank prospective cohort and 26,616 high-risk ever-smokers recruited between 2002 and 2004 to the control arm of the US National Lung Screening (NLST) randomised controlled trial. The NLST trial randomised high-risk smokers from 33 US centres with at least a 30 pack-year smoking history and fewer than 15 quit-years to annual CT or chest radiography screening for lung cancer. We externally validated our models among 49,593 participants in the chest radiography arm and all 80,659 ever-smoking participants in the US Prostate, Lung, Colorectal and Ovarian (PLCO) Screening Trial. The PLCO trial, recruiting from 1993 to 2001, analysed the impact of chest radiography or no chest radiography for lung cancer screening. We primarily validated in the PLCO chest radiography arm such that we could benchmark against comparator models developed within the PLCO control arm. Models were developed to predict the risk of 2 outcomes within 5 years from baseline: diagnosis of lung cancer and death from lung cancer. We assessed model discrimination (area under the receiver operating curve, AUC), calibration (calibration curves and expected/observed ratio), overall performance (Brier scores), and net benefit with decision curve analysis. Models predicting lung cancer death (UCL-D) and incidence (UCL-I) using 3 variables—age, smoking duration, and pack-years—achieved or exceeded parity in discrimination, overall performance, and net benefit with comparators currently in use, despite requiring only one-quarter of the predictors. In external validation in the PLCO trial, UCL-D had an AUC of 0.803 (95% CI: 0.783, 0.824) and was well calibrated with an expected/observed (E/O) ratio of 1.05 (95% CI: 0.95, 1.19). UCL-I had an AUC of 0.787 (95% CI: 0.771, 0.802), an E/O ratio of 1.0 (95% CI: 0.92, 1.07). The sensitivity of UCL-D was 85.5% and UCL-I was 83.9%, at 5-year risk thresholds of 0.68% and 1.17%, respectively, 7.9% and 6.2% higher than the USPSTF-2021 criteria at the same specificity. The main limitation of this study is that the models have not been validated outside of UK and US cohorts. Conclusions We present parsimonious ensemble machine learning models to predict the risk of lung cancer in ever-smokers, demonstrating a novel approach that could simplify the implementation of risk-based lung cancer screening in multiple settings.

• Screening for lung cancer among those at high-risk could reduce lung cancer-specific mortality by 20% to 24% among those screened, but the ideal way to determine if someone is high-risk remains uncertain and existing approaches are resource intensive.

What did the researchers do and find?
• We used data from the UK Biobank and US National Lung Screening Trial to develop novel, parsimonious models, to simplify the prediction of lung cancer risk and selection to lung cancer screening programmes.
• Using ensemble machine learning and 3 predictors-age, smoking duration, and packyears-we found our models achieved or exceeded parity in performance with leading comparators despite requiring one-third of the variables.
• Our models were externally validated in the US Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial and benchmarked against models that are either in use or have performed strongly in previous analyses.

Introduction
Screening, early detection, and disease prevention programmes are increasingly bespoke, with risk prediction algorithms determining an individual's eligibility and management [1][2][3].Such personalisation promises to improve the benefit-to-harm profile of such interventions and ultimately health outcomes [4][5][6].However, the delivery of these programmes at a population scale requires 2 conditions of risk prediction models: that they generalise well to contexts where there are insufficient data for model development, retraining, or validation; and, that the trade-off between model complexity and implementation feasibility is considered.Screening for lung cancer-the foremost cause of death from cancer worldwide [7]-with low-dose computed tomography (LDCT) has been associated with a 20% to 24% reduction in lung cancer-specific mortality among those at high risk and are screened [8,9].However, the ideal method to identify those at high risk remains unresolved.The US Preventive Services Taskforce (USPSTF) recommends the use of dichotomous criteria-age 50 to 80, �20 packyears smoked, and <15 quit-years for former smokers-to select screening participants [10].Nevertheless, identifying individuals for lung cancer screening based on risk prediction models has been shown to have both better benefit-to-harm profiles and cost-effectiveness than using dichotomous risk factor-based criteria alone [11][12][13][14], leading to risk-model-based selection criteria in European lung cancer screening pilots [15].
To date, most externally validated prediction models for lung cancer have been developed in United States datasets [12,[16][17][18][19][20][21], reflecting the relatively limited availability of suitable cohorts with long-term follow-up for prediction modelling.This implies that most global healthcare systems that implement risk-based lung cancer screening will use prediction models developed in a US population, often using variables such as ethnicity, whose categorisation varies between countries and individual datasets, and academic qualifications that differ both over time and between jurisdictions.In the United Kingdom, existing models have been shown to underperform in specific groups, such as the more socioeconomically deprived, where underestimation of risk could lead to a screening programme systematically widening health inequalities [22].
Furthermore, the risk models currently in use are a challenge to implement.In the UK, eligibility for lung cancer screening pilots is based on the PLCOm2012 and Liverpool Lung Project risk models, requiring 17 unique variables, few of which are routinely available [23].
Collecting these variables from an individual who is potentially eligible and explaining the results currently averages between 5 and 10 min.To determine the screening eligibility of 1 million people would therefore require between 48 and 95 full-time staff a year.In the UK, there are an estimated 6.8 million ever-smokers aged 55 to 74 who are potentially eligible for lung cancer screening, with another 500,000 turning 55 on average each year [24,25].As lung cancer screening is just one of several risk-based programmes that are either in development or in use, in their current form, these assessments present a formidable obstacle to the effective implementation of national screening programmes.
In this study, we hypothesised that using ensemble machine learning with training data spanning different geographic regions, populations, and average risk levels, we could develop predictive models for lung cancer screening with a minimum number of features that has broad applicability.In so doing, we aimed to combine the simplicity of risk-factor-based criteria with the improved predictive performance of risk models, while maintaining generalisability to new settings.

Ethical statement
The University College London Research Ethics Committee gave ethical approval for this study (reference: 19131/001).

Data sources and study population
Development and internal validation datasets.For model development, we first used data on 216,714 ever-smokers without a prior history of lung cancer from the UK Biobank [26] before creating a multicountry dataset that combined UK Biobank and US National Lung Screening Trial (NLST) [8] data (n = 26,616) (Fig 1; participant flow diagrams in Figs A and B in S1 Appendix).The UK Biobank is a large prospective cohort recruited between 2006 and 2010 from 22 British centres that combines phenotypical data with ongoing linkage to hospital and registry data [27].During this timeframe, the UK has not had a systematic screening programme for lung cancer.The NLST was a randomised controlled trial of lung cancer screening comparing computed tomography (CT) against chest radiography in 33 US centres between 2002 and 2004 with follow-up through 2009 [28].Participation in the NLST was restricted to those considered at high risk of developing lung cancer: a 30 pack-year smoking history and, if a former smoker, to have quit within 15 years of enrolment [28].
We selected the NLST because it is geographically distinct, includes a higher risk cohort, and has greater ethnic diversity than the UK Biobank.By combining NLST data with the UK Biobank, which by contrast is known to represent a cohort with lower mortality risks than the UK general population [29], our prediction models would be trained on a wider range of participants, with potentially improved model performance.
External validation datasets.For model validation, we used data from 40,593 ever-smokers without a prior history of lung cancer from the chest radiography arm of the US Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO) [30] trial (Fig C in S1 Appendix).This allowed benchmarking against comparator models that were developed in the control arm of the PLCO trial.Chest radiography was found to have no impact on lung cancer mortality, nor a statistically significant impact on lung cancer incidence [30].In secondary analyses presented in S1 Appendix, we report model performance in both arms of the full PLCO dataset together (n = 80,659).

Missing data
We used multiple imputation by chained equations (MICE) with predictive mean matching to generate imputed development and validation datasets [31].We generated 10 imputed sets of the UK Biobank, based on an average missingness among candidate predictors in the UK Biobank of 11%.As missingness was <5% for all relevant variables in the PLCO and NLST, we created 5 imputed PLCO and NLST datasets.See Table A

Outcomes
We developed models to predict the absolute cumulative risk of 2 outcomes within 5 years from baseline: diagnosis of lung cancer and death from lung cancer.Lung cancer status and primary cause of death in the UK Biobank were determined by linked national cancer registry and Office for National Statistics data [26].In the NLST and PLCO, primary cause of death was confirmed by independent review of death certificates [28,30].In the NLST, lung cancer diagnoses were ascertained through medical record abstraction, and in the PLCO, through mailings and telephone calls to participants [8,30].

Model development
We developed ensembles of machine learning pipelines using AutoPrognosis, open-source automated machine learning software [32,33].In this analysis, AutoPrognosis was used to A multicountry dataset comprising the UK Biobank and NLST was used to develop new models before external validation in the PLCO Trial chest radiography arm (allowing benchmark comparison with existing models developed in the PLCO control arm) and the full PLCO cohort (a).The ensemble modelling approach involves optimising individual modelling pipelines before combining their results as a single prediction for each individual.(b) Shows details of the UCL-D model, including the weights attributed to each pipeline in generating a single prediction for the five-year risk of lung cancer for any individual.(c) Shows the contribution of different variables to overall predictions as well as interactions between predictors, analysed using Shapely Additive Explanations (SHAP) on the UK Biobank [35].The first subfigure in (c) shows that smoking duration was the most important variable when making predictions of an individual's risk of dying from lung cancer, followed by pack-years smoked, and finally age.The 3 subsequent dependence subplots show the relationship between the predictor (x-axis) against the outcome (y-axis)-the importance of knowing that predictor value when making a prediction.The vertical dispersion shows the degree of interaction effects present, while the colour corresponds to a second variable.The plots show that smoking for less than approximately 35 years had relatively little impact on model predictions, with a steep inflection and increasing interaction between smoking duration and pack-years after this point.This relationship between smoking duration and pack-years mirrors that seen in the previous subfigure, with duration trumping quantity of cigarettes smoked unless both are high.In other words, those individuals who smoke for short periods of time have a lower predicted risk, even if they smoke relatively large quantities.This reflects our understanding of lung biology and the ability of the lung to repair itself if an individual stops smoking [56].Lastly among subfigures of (c), we see that age has relatively limited impact on the model under the age of 60.NLST, National Lung Screening Trial; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial.https://doi.org/10.1371/journal.pmed.1004287.g001optimise pipelines consisting of a variable preprocessing step followed by model selection and training.These optimised pipelines were subsequently combined and a single prediction for any individual generated by a weighted combination of the predictions made by each of the 4 pipelines independently, with weighting by Bayesian model averaging (Fig 1) [32].We trialled model algorithms including logistic regression, random forests, and Gradient Boosting approaches (see subsection Model Development in S1 Appendix for further details).Throughout, pipelines were trained and selected to maximise model discrimination, measured with the area under the receiver operating curve (AUC).

Model explanation
We used the Kernel Shapely Additive Explanations (SHAP) [34] algorithm for model explanation and analysis of predictor interactions (Fig 1).Kernel SHAP is a permutation-based method theoretically based in coalitional game theory.In summary, each variable is passed to a model one-by-one, with the change in predictions that occurs attributed to this model [35,36].Further details are available in subsection Variable Importance and Interaction in S1 Appendix.

Variable selection
For pragmatic reasons, we considered candidate predictors from the UK Biobank that were also present in the NLST and PLCO (Table B in S1 Appendix).We settled on our final list of predictors based on the literature, domain expertise, variable distributions, generalisability to multiple settings, and model discrimination in the UK Biobank.This led to the development of full models.Reviewing feature importance, we found that age, smoking duration, and packyears were driving predictions, leading us to develop models using these 3 predictors (Fig G in S1 Appendix).

Statistical analysis
We considered a model's overall performance with the Brier score [37], discrimination with the AUC, calibration with calibration curves and the ratio of expected-to-observed cases, and clinical usefulness with decision curve analysis [38].Calibration curves were calculated by splitting individuals into 10 risk deciles based on their predicted risk before compared predicted probability against observed risk, the latter calculated using a Kaplan-Meier model.For a measure of clinical utility, we considered the net benefit of models across a range of risk thresholds [38].We compared model discrimination with a two-tailed bootstrap test using the methods of Hanley and MacNeil, modified by Robin and colleagues [39,40].To determine potential risk thresholds for our models, we used a fixed population strategy, comparing the number of individuals eligible for screening in the entire PLCO external validation dataset using the USPSTF-2021 criteria.
In both internal and external validation, we generated 1,000 bootstrap resamples with replacement for all analyses; central estimates and 95% confidence intervals were calculated with the percentile method.We used optimism-corrected metrics for internal validation.All analyses were conducted with R [41] and Python [42].

Model comparisons
For benchmark comparisons, we compared our new models to the USPSTF-2021 criteria (age 50 to 80, �20 pack-year smoking history, and quit within the last 15 years if a former smoker) [10], as well as existing risk models that are either in use (PLCOm2012 [18] and Liverpool Lung Project (LLP) version 2 [43]) or have been externally validated and consistently shown to outperform other risk models (the Lung Cancer Death Risk Assessment Tool [LCDRAT] and Lung Cancer Risk Assessment Tool [LCRAT] [19]) (Table C in S1 Appendix) [13,22,44,45].All comparator models predict the five-year risk of death (LCDRAT) or developing lung cancer (LCRAT, LLP) except for the PLCOm2012 that predicts the six-year risk of lung cancer occurrence.A third, recalibrated, version of LLP has been developed.Because it is not currently in use, we present full comparative analyses in the Appendix but note that in using the same predictors and coefficients as LLP version 2, its discrimination is equivalent.Further, we also compared against Cox models developed using the same dataset (see Methods in S1 Appendix) and the constrained versions of the LCDRAT, LCRAT, and PLCOm2012 models.
All variables were available for comparator models except the LLP.For the LLP, in the UK Biobank, data were not available for age at which a family member developed lung cancer.Following ten Haaf and colleagues [44], and reflecting UK lung cancer epidemiology [46], we assumed that all with a family history of lung cancer were aged over 60.In the PLCO dataset, asbestos exposure and prior history of pneumonia were not available and were set to zero.We used the lcmodels package in R to calculate predictions for the PLCOm2012, LCRAT, and LCDRAT models [47].

Results
The descriptive characteristics of the UK Biobank and NLST development datasets and the PLCO external validation dataset are presented in Table 1.Characteristics by outcome are presented in Tables D-G in S1 Appendix.The number of cancers diagnosed and deaths from lung cancer are presented by follow-up period in Table H in S1 Appendix.
We found that age, smoking duration (years), and pack-years of smoking, drove most predictions.This led us to focus our analyses on developing 2 models: UCL-D and UCL-I, that used just these 3 variables.UCL-D predicts the five-year risk of dying from lung cancer and was a weighted ensemble consisting of 4 modelling algorithms: AdaBoost [48,49], LightGBM [50], Logistic Regression, and Linear Discriminant Analysis.UCL-I predicts the five-year risk of developing lung cancer and included AdaBoost [48,49], LightGBM [50], Bagging, and Cat-Boost [51] algorithms.Details of the ensemble pipelines, their weightings and algorithm hyperparameters are presented Figs H and I and Tables I and J in S1 Appendix.

Calibration
The UCL models were well calibrated across risk thresholds at which eligibility for screening is typically set, tending modestly towards underprediction in the highest risk decile in the PLCO radiography arm (Fig 2 ).By contrast, PLCOm2012 and LCRAT tended modestly towards underprediction at deciles corresponding to observed risks of 1% to 4%, which is more clinically disadvantageous than overprediction.As the PLCOm2012, LCDRAT, and LCRAT models were developed in the control arm of the PLCO trial, the strong relative performance of the UCL models is notable.All models modestly overpredicted risk in the UK Biobank cohort, with the extent of overprediction most notable for the LLP version 2.

Overall performance
When considering Brier scores, an overall measure of model performance comparing the closeness of predicted probabilities and observed outcomes [37], there was little or no distinction between the models in the UK Biobank and PLCO radiography arm (Tables L and M in S1 Appendix).In the PLCO radiography arm, both models predicting the five-year risk of death, UCL-D and LCDRAT had a Brier score of 0.0084 (95% CI: 0.0075, 0.0093).Brier scores vary with prevalence; consequently, models predicting the risk of developing lung cancer had higher scores.Nevertheless, the same pattern was observed: UCL-I had a Brier score of 0.0153 (95% CI: 0.0142, 0.0164), LCRAT a score of 0.0152 (95% CI: 0.0143, 0.0164), and LLP version 2 a score of 0.0153 (95% CI: 0.0143, 0.0165).

Risk thresholds to select individuals for screening
Using the USPSTF-2021 criteria, 34,654 (43.0%) of the entire PLCO dataset would be eligible for lung cancer screening.All UCL models had higher sensitivity than the USPSTF-2021 at an equivalent specificity, with the gains in sensitivity higher when predicting five-year risk of death from lung cancer ( At the aforementioned risk cut-offs, 96.2% of individuals selected by UCL-D would also have been eligible for screening with UCL-I.By 10-years of follow-up, those selected for screening with UCL-D but not UCL-I tended towards a greater risk of developing and dying from lung cancer than those selected by UCL-I but not UCL-D, though this trend was not statistically significant (Fig J in S1 Appendix; Logrank test: p = 0.15 for differences in lung cancer deaths and p = 0.41 for differences in lung cancers).Using decision curve analysis, at all risk thresholds, the net benefit of the UCL models is greater than screening using the USPSTF-2021 criteria (Fig 3   per day, and quit-years.Both the LCRAT/LCDRAT and PLCOm2012 models were developed in the control arm of the PLCO trial.The relatively shallow drop-off in discriminatory performance between the various constrained models and their full versions show the relative importance of few smoking parameters and validates our findings that few smoking variables drive all lung cancer models in ever-smokers.The improvement seen by UCL-D over Cox models using the same data and variables reflects the statistical advantages of ensemble machine learning approaches.
https://doi.org/10.1371/journal.pmed.1004287.t003Ideal calibration is 1, where the expected and observed number of cancers is the same.The UCL models were well calibrated across most subgroups.Specifically, calibration was good across age groups, sexes, and by smoking status.As the number of outcomes in sub-groups was very small, in certain sub-groups the confidence intervals were wide.Notably, all models underpredicted risk in at least 1 ethnic subgroup.Of predictors used in other models, but not in the UCL models, there was some underprediction of risk among those with COPD and a family history of lung cancer.

Discussion
We have developed parsimonious models for lung cancer screening that combine the simplicity of existing risk factor-based criteria with the predictive performance of complex risk prediction models.Furthermore, we show in benchmarking comparisons that ensemble machine learning models with 3 predictors-age, smoking duration, and smoking pack-years-have equivalent predictive performance and clinical usefulness to existing models requiring 11 predictors.
In this analysis, we used ensemble machine learning to leverage the predictions of several optimised model pipelines.Ensemble modelling is based on the concept that different models make different types of mistake, and their errors begin to cancel each other out, such that combining these statistical models could be expected to improve the performance that any one might achieve [52].By iteratively trialling and optimising a wide range of modelling approaches before subsequently creating ensembles of these approaches, AutoPrognosis ensures that the strongest performing model for that dataset will be derived and allows reproducibility by transparently showing how models were selected.This avoids the need to develop multiple independent models.
In the UK, eligibility for National Health Service screening pilots is based on meeting either a five-year absolute risk of lung cancer of �2.5% with the LLP risk score or a six-year absolute risk of �1.51% with the PLCOm2012 [23].The use of 2 risk scores where eligibility differs by more than a percentage point in predicted absolute risk, and where a higher risk is tolerated over a five-year period than a six-year period, highlights the policy challenge in adopting the optimal risk-based approach for a particular setting.This approach requires the collection of 17 different unique predictors, as well as the mapping of US educational levels and US ethnicity categorisations to the UK.With an estimated 7 million current smokers in the UK [25]even ignoring former smokers-the time and resource requirements to determine screening eligibility at a population scale will be challenging.Using 3 unambiguous variables but with equivalent or improved performance, the UCL models could be completed more easily online or in primary healthcare, simplifying the implementation of lung cancer screening.A potential risk with using only few predictors is that the models will underperform in different subgroups.Across all major subgroups the performance of the UCL models was equivalent to existing models and importantly this included all 4 ethnic groups available in this analysis.The UCL models were also well calibrated in different subgroups, with no sex, smoking status, or age differences.However, there was some undercalibration in 2 groups used as predictors in comparator models: a history of COPD and family history of lung cancer.Furthermore, all models were undercalibrated in at least 1 ethnic group.As discrimination was good for most models, this suggests that lower thresholds could be considered, particularly in black populations, analogous to the relaxation in the age and smoking intensity criteria made between the USPSTF-2013 and USPSTF-2021 screening recommendations [53].More work is required to improve calibration among different ethnicities, although it is notable that simply including this predictor in models had limited impact.
In keeping with Katki and colleagues [19], we found that UCL-D, predicting the risk of death from lung cancer, had greater discrimination than models predicting lung cancer occurrence.In these analyses, there was >96% overlap between UCL-D and UCL-I in terms of those selected for screening, with those selected by UCL-D but not UCL-I showing a trend towards a greater risk of death from lung cancer with longer follow-up (Fig J in S1 Appendix).In microsimulation modelling, overall outcomes differed little between a model predicting death from lung cancer compared with models predicting developing lung cancer [13].Given this, UCL-D would be the more appropriate model to consider for implementation.
This study has strengths.We used large prospective cohorts for model development and validation.Our external validation cohort is both temporally and geographically distinct.We used robust methods for model development and internal validation while externally validating our models extensively using multiple approaches and in a wide range of subgroups.Further, we benchmarked our models against leading comparators.Moreover, by using few, unambiguous, variables, our models could be widely applied after further validation and, where necessary, recalibration.Finally, we have made our models openly available for independent assessment.
This study has several limitations.We have used retrospective data, such that findings may differ if used to prospectively determine screening eligibility.However, both the PLCOm2012 and the LLP models have been studied in prospective settings, establishing the benefits of riskmodel against risk-factor-based screening.By benchmarking against these models, we can be confident in the performance of our models in a screening programme.To confirm the generalisability of our models, validation in datasets from beyond the US and UK will be the subject of further work.Our analyses have been performed in research cohorts rather than routinely collected electronic health records that may better reflect the broader population.In keeping with the models used as benchmark comparators in this work, the UCL models may not perform to the same extent in routinely collected electronic health records as smoking data are not usually present in the same depth [54].Nevertheless, screening programmes are unlikely to rely on existing electronic health records given known challenges of missing and inaccurately coded predictors [55].Finally, our risk models exclude never-smokers.To date, no risk model has been able to discriminate those never smokers with sufficient risk to meet existing criteria for lung cancer screening.
In conclusion, we used ensemble machine learning to explicitly maximise model parsimony, an approach that holds promise in multiple disease areas.Our prediction models to determine lung cancer screening eligibility require only 3 variables-age, smoking duration, and pack-years-and perform at or above parity with existing risk models in use.Further validation in alternative datasets as well as prospective implementation should be considered.

Fig 1 .
Fig 1. Developing the UCL models to determine lung cancer screening eligibility.AU : AnabbreviationlisthasbeencompiledforthoseusedinFig1:PleaseverifythatallentrieA multicountry dataset comprising the UK Biobank and NLST was used to develop new models before external validation in the PLCO Trial chest radiography arm (allowing benchmark comparison with existing models developed in the PLCO control arm) and the full PLCO cohort (a).The ensemble modelling approach involves optimising individual modelling pipelines before combining their results as a single prediction for each individual.(b) Shows details of the UCL-D model, including the weights attributed to each pipeline in generating a single prediction for the five-year risk of lung cancer for any individual.(c) Shows the contribution of different variables to overall predictions as well as interactions between predictors, analysed using Shapely Additive Explanations (SHAP) on the UK Biobank[35].The first subfigure in (c) shows that smoking duration was the most important variable when making predictions of an individual's risk of dying from lung cancer, followed by pack-years smoked, and finally age.The 3 subsequent dependence subplots show the relationship between the predictor (x-axis) against the outcome (y-axis)-the importance of knowing that predictor value when making a prediction.The vertical dispersion shows the degree of interaction effects present, while the colour corresponds to a second variable.The plots show that smoking for less than approximately 35 years had relatively little impact on model predictions, with a steep inflection and increasing interaction between smoking duration and pack-years after this point.This relationship between smoking duration and pack-years mirrors that seen in the previous subfigure, with duration trumping quantity of cigarettes smoked unless both are high.In other words, those individuals who smoke for short periods of time have a lower predicted risk, even if they smoke relatively large quantities.This reflects our understanding of lung biology and the ability of the lung to repair itself if an individual stops smoking[56].Lastly among subfigures of (c), we see that age has relatively limited impact on the model under the age of 60.NLST, National Lung Screening Trial; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial.
and Fig K in S1 Appendix).At suggested risk thresholds, the net benefit of compared risk models other than LLP are equivalent.

Fig 2 .
Fig 2. Calibration curves.Calibration curves showing UCL and comparator models in the UK Biobank (dark blue dashed lines) and US PLCO Cancer Screening Trial chest radiography arm (light blue line).The 45-degree lines in grey indicate perfect calibration.Curves were generated by splitting individuals into 10 risk deciles based on their predicted risk.Each curve shows the mean predicted risk against the observed risk by risk decile.Observed risk was calculated using a Kaplan-Meier estimator.The UCL models showed good calibration in external validation in the PLCO intervention arm, particularly at predicted risk between 1% and 2% at which risk thresholds are commonly set.At these thresholds, there was modest underprediction with the LCDRAT, LCRAT, and PLCOm2012 models in the PLCO intervention arm.All models modestly overpredicted risk in the UK Biobank, with the exception of the Liverpool Lung Project (LLPv2) version 2 model, which strongly overpredicted risk.LCDRAT, Lung Cancer Death Risk Assessment Tool; LCRAT, Lung Cancer Risk Assessment Tool.UCL-D predicts lung cancer death; UCL-I predicts occurrence of lung cancer.https://doi.org/10.1371/journal.pmed.1004287.g002

Fig 3 .
Fig 3. Decision curves of selected models in the PLCO validation cohort.Net benefit across a range of thresholds of models predicting five-year risk of death from lung cancer (A) and developing lung cancer (B) compared against US Preventive Services Taskforce (USPSTF) 2021 screening eligibility criteria in the PLCO Cancer Screening Trial chest radiography arm validation dataset.The PLCOm2012 model predicts six-year risk of lung cancer.As the performance of PLCOm2012 over a five-year timeframe was similar to that of six-years, for comparability, predictions over a five-year timeframe are shown here.All models studied except the Liverpool Lung Project (LLPv2) version 2 had a greater net clinical benefit than using the USPSTF-2021 criteria for screening eligibility across all risk thresholds.All other risk models had a comparable net benefit to each other.LCDRAT, Lung Cancer Death Risk Assessment Tool; LCRAT, Lung Cancer Risk Assessment Tool.UCL-D predicts lung cancer death; UCL-I predicts occurrence of lung cancer.https://doi.org/10.1371/journal.pmed.1004287.g003

Table 1 . Descriptive characteristics of the development and validation cohorts.
Table O in S1 Appendix).For UCL-I at a five-year risk threshold of 1.17%, the gains in sensitivity were 6.2% relative to the USPSTF-2021 criteria (83.

Table 3 . Discrimination of UCL-D, Cox models, and the constrained LCDRAT, LCRAT, and PLCOm2012 models.
AUC, area under the receiver operating curve; CI, confidence intervals; PLCO, Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial; LCDRAT, Lung Cancer Death Risk Assessment Tool; LCRAT, Lung Cancer Risk Assessment Tool.UCL-D, the 2 Cox models, and LCDRAT-constrained predict 5-year risk of lung cancer death; LCRAT-constrained and PLCOm2012 constrained 5-year risk of lung cancer occurrence.Cox models were modelled with restricted cubic splines, with and without mutual interactions between age, smoking duration, and pack-years using the same development dataset as UCL-D.The LCDRAT and LCRAT-constrained models use age, sex, quit-years, smoking duration, cigarettes per day, and pack-years.PLCOm2012-constrained uses age, smoking status, smoking duration, cigarettes