Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

An approach for developing a blood-based screening panel for lung cancer based on clonal hematopoietic mutations

  • Ramu Anandakrishnan ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    ramu@vt.edu

    Affiliations Edward Via College of Osteopathic Medicine, Biomedical Sciences, Blacksburg, Virginia, United States of America, Maryland-Virginia College of Veterinary Medicine, Virginia Tech, Blacksburg, Virginia, United States of America

  • Ryan Shahidi,

    Roles Formal analysis, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliation Edward Via College of Osteopathic Medicine, Biomedical Sciences, Blacksburg, Virginia, United States of America

  • Andrew Dai,

    Roles Investigation, Software, Writing – original draft, Writing – review & editing

    Affiliation Edward Via College of Osteopathic Medicine, Biomedical Sciences, Blacksburg, Virginia, United States of America

  • Veneeth Antony,

    Roles Investigation, Software, Writing – original draft, Writing – review & editing

    Affiliation Edward Via College of Osteopathic Medicine, Biomedical Sciences, Blacksburg, Virginia, United States of America

  • Ian J. Zyvoloski

    Roles Formal analysis, Investigation, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliation University of Maryland, Baltimore, Maryland, United States of America

Abstract

Early detection can significantly reduce mortality due to lung cancer. Presented here is an approach for developing a blood-based screening panel based on clonal hematopoietic mutations. Animal model studies suggest that clonal hematopoietic mutations in tumor infiltrating immune cells can modulate cancer progression, representing potential predictive biomarkers. The goal of this study was to determine if the clonal expansion of these mutations in blood samples could predict the occurrence of lung cancer. A set of 98 potentially pathogenic clonal hematopoietic mutations in tumor infiltrating immune cells were identified using sequencing data from lung cancer samples. These mutations were used as predictors to develop a logistic regression machine learning model. The model was tested on sequencing data from a separate set of 578 lung cancer and 545 non-cancer samples from 18 different cohorts. The logistic regression model correctly classified lung cancer and non-cancer blood samples with 94.12% sensitivity (95% Confidence Interval: 92.20–96.04%) and 85.96% specificity (95% Confidence Interval: 82.98–88.95%). Our results suggest that it may be possible to develop an accurate blood-based lung cancer screening panel using this approach. Unlike most other “liquid biopsies” currently under development, the approach presented here is based on standard sequencing protocols and uses a relatively small number of rationally selected mutations as predictors.

Introduction

Cancer is a leading cause of death in the US, second only to heart disease, with lung cancer accounting for an estimated 21% of these deaths [1]. Early detection of lung cancer has been shown to reduce mortality by 20% [2]. However, the recommended screening for lung cancer, low dose computed tomography [3], is restricted to high-risk individuals, with an estimated uptake of only 17.54% even within this group due to financial and other barriers [4]. Clearly, there is a need for more accessible lung cancer screening protocols. Several blood-based screening panels (“liquid biopsies”) are currently in various stages of validation, though none have yet achieved clinical utility [5], other than as companion diagnostics [6]. These include screens for antibodies [7], circulating microRNA [8], and cell-free DNA fragments [9]. For example, Galleri, a multi-cancer blood test based on the methylation state of cell-free DNA, is currently in clinical trials even though it had a poor 51.5% sensitivity (95% confidence interval (CI): 49.6–53.3%), but a high 99.5% specificity (95% CI: 99.0–99.8%) in a clinical validation study [10].

Here we present an approach for developing a blood-based lung cancer screening panel based on a set of clonal hematopoietic (CH) mutations. Clonal hematopoiesis is the clonal expansion of a single hematopoietic stem or progenitor cell (HSPC) due to specific mutations in these cells. While mutations in tumor cells that inhibit anti-tumor immune function have been extensively studied [11, 12], the potential impact of immune cell mutations on anti-tumor activity remains relatively unexplored [13]. Recent studies have shown that CH mutations can lead to the clonal expansion of potentially pathogenic mutations in a relatively large population of tumor infiltrating immune (TII) cells, affecting their anti-tumor activity. A loss-of-function TET2 mutation in tumor infiltrating myeloid cells was shown to increase angiogenesis in a lung cancer animal model, exacerbating tumor progression [14]. An ASXL1 mutation in tumor infiltrating T-cells was shown to perturb their development and function, promoting tumor growth in syngeneic animal models [15]. On the other hand, TET2 inactivation in tumor infiltrating lymphocytes and myeloid cells has also been shown to inhibit tumor growth [16, 17], indicating a dependence on other factors. Studies have also shown that mutations affecting the expression of specific genes in immune cells can modulate their anti-tumor activity [18, 19]. Although these mutations do not directly “drive” tumor growth, they may modulate it by inhibiting or enhancing anti-tumor immune response.

Clonal hematopoiesis occurs in over 10% of adults over 70 years of age [20] and has been implicated in cardiovascular and hematologic disorders, including leukemias and lymphomas [21, 22]. Over 20 different mutations have been associated with clonal hematopoiesis, with mutations in DNMT3A, TET2 and ASXL1 accounting for over 90% of age-related cases [23]. Although the associated mechanism is poorly understood, studies suggest three possible mechanisms: increased self-renewal, increased number of self-renewal cycles to become a committed progenitor, and/or increased epigenetic or transcriptional heterogeneity leading to a highly proliferative state [2427]. However, clonal hematopoiesis by itself may not affect immune function. Instead, secondary mutations in clonally expanding cells may be required for immune dysfunction [23, 28, 29], which could affect anti-tumor immune function.

We hypothesize that the presence of pathogenic CH mutations in TII cells could impact anti-tumor immune function and therefore, could serve as predictive markers for cancer risk. In the following, we develop a machine learning model to predict the probability of cancer based on the detection of these mutations in blood samples.

Materials and methods

Exome sequencing data and variant calling

For data from the Sequence Read Archive (SRA) [30], fastq files from the repository were aligned to the GRCh38 reference genome using the Burroughs-Wheeler aligner (BWA) v0.7.15 [31]. For data from the Genomic Data Commons (GDC), we downloaded aligned read (bam) files which were already aligned to the GRCh38 reference genome using BWA. We used the same aligner (BWA) and reference genome (GRCh38) for SRA, as was used for the GDC data to produce the bam files. Variants were then called using GATK Mutect2 v3.7.0–1 [32]. The same read alignment and variant calling protocols were used for all datasets. The Clara parabricks v4.0.0–1 [33, 34] implementation of these programs was used for parallel execution on NVIDIA GPUs. Default values were used for all parameters, including minimum variant calling accuracy (Log Odds > 3.0). To identify potentially pathogenic mutations only, we also downloaded Mutect2 mutation annotation format (MAF) files from the GDC, which include variant allele fraction (VAF) values for matched tumor and blood samples. The MAF files were not used to train or test the ML model.

Single-cell RNA sequencing (scRNA-seq) data and cell type analysis

Cellranger v6.0.2 [35] was used to identify genes expressed in each cell, from scRNA-seq fastq files. SCSA v1.1 [36] was used to identify cell types from k-means clustering of cells based on the genes expressed by each cell. Default values were used for all parameters.

Training, validation, and test sample sets

Ten of 18 datasets were randomly split into Training, Validation, and Test sets using the train_test_split() module in the Python scikit learn v1.2.2 library [37]. The other eight datasets were used as the independent test set.

Machine learning (ML) algorithms

The ML algorithms implemented in the scikit_learn v1.2.2 library were used to build the ML models [37]. See S3 Table in S1 File for hyperparameter values.

Statistical analysis

The binomial proportion confidence interval (CI) with a normal approximation was used to calculate CI for accuracy, using the proportion_confint() module in the Python statsmodels v0.13.15 library [38]. Standard errors for Logistic Regression (LR) coefficients were estimated using the bootstrap method, using 1000 iterations of sampling with replacement [39]. The estimated standard error was used to calculate Z-statistic, CI, and p-value for the LR coefficients.

Results

In this study we first identified a set of 98 potentially pathogenic (non-passenger) CH mutations in TII cells in lung cancer tissue samples, following the approach in Ref [40] and detailed below. The variant allele fraction (VAF) for these mutations in blood samples was then used as the feature set (predictors) for developing a machine learning model for differentiating between lung cancer and non-cancer cases. A logistic regression model based on these mutations in blood samples was able to differentiate between a test set of lung cancer and non-cancer samples with 90.26% accuracy (95% CI: 88.50–92.01%).

Potentially pathogenic CH mutations in TII cells

Our approach for identifying potentially pathogenic CH mutations in TII cells consisted of three stages (Fig 1a). First, we identified protein altering variants in the TCGA dataset for lung adenocarcinoma (LUAD) [41]. These variants consisted of missense, nonsense, indel, and splice site mutations (Fig 1b). Other mutations, such as intergenic, intronic and synonymous mutations are less likely to be pathogenic and were therefore excluded. Of the 2.5 million distinct variants in the 569 TCGA-LUAD samples, 428,015 were protein-altering (Fig 1a).

thumbnail
Fig 1. Identifying potentially pathogenic clonal hematopoietic (CH) mutations in tumor infiltrating immune (TII) cells.

(a) Mutations were limited to protein-altering variants that were clonally expanded in TII cells and were considered to be potentially pathogenic non-passenger mutations. (b) Protein altering mutations excluded intergenic, synonymous, and intronic variants. (c)-(e) Example of criteria used to select the S1506L variant in VWF. (c) CH mutations in TII were identified by variant allele fraction (VAF) in tumor and normal blood samples. (d) One of the four criteria for identifying potentially pathogenic mutation was that the mutation occurred in > 5% of tumor samples, but rarely (< 0.01%) in the genome aggregation database (genomAD). (e) Another criterion used to identify potentially pathogenic mutations in TII was that the gene was expressed in tumor infiltrating immune cells based on lung cancer single-cell RNA sequencing data.

https://doi.org/10.1371/journal.pone.0307232.g001

From the above set of protein-altering mutations, we selected CH mutations in TII cells based on variant allele fraction (VAF) in matched tumor and normal blood samples. Clonal hematopoiesis is conventionally defined as somatic mutations with a VAF > 2% [20]. This lower limit excludes mutations in circulating tumor cells or cell-free DNA which have a VAF < 1% [42, 43]. To exclude potential germline mutations, we selected variants with a VAF < 25%. To identify mutations in TII cells, we further selected variants that occurred in both tumor and matched blood samples. It is highly improbable that the same somatic variant would originate in tumor and blood cells simultaneously. It is more likely that these mutations originated in blood cells and subsequently infiltrated the tumor microenvironment, leading to their detection in tumor infiltrating blood cells. For example, the S1506L mutation in the gene for von Willebrand factor (VWF) occurs with a median VAF of 7.23% in TCGA-LUAD tumor samples containing the mutation, and in 4.90% of matched blood samples (Fig 1c). Of the 428,015 protein-altering variants, 36,293 were CH mutations in TII cells (Fig 1a).

From the above set of CH mutations in TII cells, we selected potentially pathogenic mutations based on the following three criteria. First, the mutation occurred frequently in tumor samples (> 5% of TCGA-LUAD samples), but rarely in large population-based studies (< 0.01% of sequences in the genome aggregation database (gnomAD)), suggesting that the genomic location is highly conserved in the germline, but somatic mutations may be correlated with tumor progression. For example, the S1506L variant in VWF occurred in 6.85% of the 569 TCGA-LUAD samples (Fig 1d), whereas it was not detected in any of the gnomAD germline sequences (S2 Table in S1 File). Second, the gene was expressed in TII cells based on the analysis of single-cell RNA-sequencing data (Methods) for 87,743 cells from 21 lung tumor samples [4446]. Genes not expressed in TII cells may not be functional in this context and were therefore excluded. For example, VWF was expressed in lung tumor infiltrating B, T, NKT, and monocyte/macrophage cells (Fig 1e). Lastly, the mutation was predicted to be damaging by two mutation-significance prediction tools–SIFT [47] and PolyPhen2 [48] (S2 Table in S1 File). Of the 36,293 protein-altering CH mutations in TII cells, 98 were potentially pathogenic mutations based on the above criteria (Fig 1a). The 98 potentially pathogenic CH mutations in TII cells are listed in S2 Table in S1 File, along with values for the criteria used to identify them. Although the selected mutations are unlikely to be passenger mutations, they are also not “driver” mutations, in the sense that they do not directly cause tumor cell growth. However, by potentially affecting immune cell function, they may modulate tumor growth and may therefore represent potential predictors of cancer risk.

Machine learning (ML) study design

To develop an ML model with realistic prediction potential using available sequencing data, a key consideration was mitigating potential confounding or batch effects. In general, large-scale cancer DNA sequencing studies do not include non-cancer controls [41], and large-scale non-cancer studies do not include cancer cases [49]. Although we used the same protocol for aligning sequencing reads to the reference genome and for variant calling, these studies may use different protocols for cohort selection and sequencing. Therefore, ML models may classify cases based on differences in mutations detected that are artifacts of these protocol differences rather than characteristics related to the presence of cancer. To mitigate such confounding (batch) effects, we used data from multiple different studies. We also evaluated alternative models using the average batch accuracy to avoid bias toward protocols used by the larger studies. For example, two of the non-cancer cohorts, prjna532465 and prjna790003, had average ages of 82 and 77 years, respectively, significantly higher than the 63 years for the lung cancer samples (Table 1). However, by including the non-cancer cohorts, prjna421434 and prjna342304, with average ages of 42 and 39 years, respectively, the overall average age for the non-cancer samples was 64 years, comparable to the average age for lung cancer samples, to the extent the data is available. As noted in the Discussion section below, these strategies can mitigate the confounding effects but cannot fully eliminate them.

thumbnail
Table 1. Datasets used for model training, validation, and testing.

https://doi.org/10.1371/journal.pone.0307232.t001

A set of whole exome sequencing data for blood samples from 18 different cohorts were identified for this study (Fig 2, Table 1). These included a total of 1,992 lung cancer and non-cancer blood samples. On average, for the cohorts for which demographic data was available, the age and gender composition for the lung cancer and non-cancer cohorts were comparable. The average age and gender composition were 63 years and 56% male for the lung cancer cases, respectively, and 60 years and 51% male for the non-cancer cases (Table 1). Although the racial compositions were significantly different with 71% white in the lung cancer cohorts and 42% white in the non-cancer cohorts, racial information was not available for nine of the 18 cohorts and two of the non-cancer cohorts targeted Asian populations. Although the percentage of current or past smokers in lung cancer cases (63%) was comparable to non-cancer cases (66%), smoking status was only available for two of the nine non-cancer cohorts (Table 1). The proportion of samples by stage also varied by cohort, with batch averages of 24.18% (0–54.35%) for stage I, 21.91% (0–40.74%) for stage II, 21.32% (0–64.71) for stage III, 20.60% (0–100%) for stage IV, and 11.93% (0–46.15%) for pre-cancerous cases (S1 Table in S1 File). Five of the nine lung cancer cohorts were all treatment naïve and two of the cohorts were cases with previous treatments (S1 Table in S1 File). Methods for accounting for batch-effects is an active area of research [50], however it is unclear how effective these methods would be for somatic mutation data, or if they would result in the loss of critical information.

thumbnail
Fig 2. Machine learning study design.

Sequencing data from multiple different cohorts were used to mitigate batch-effect artifacts due to differences in cohort selection and sequencing protocols. Training and Validation set: 75% of the sequencing data from 10 of 18 different cohorts were used to train the machine learning models and identify a model that had the highest average batch accuracy. Test set: All data for 8 of the 18 cohorts, in addition to 25% of the data from the other 10 cohorts.

https://doi.org/10.1371/journal.pone.0307232.g002

All data from eight of the 18 cohorts were set aside as an Independent Test set, representing a total of 830 samples (Fig 2). Since there is no single large cohort containing both lung cancer and non-cancer sequencing data, the independent test set represents the best possible test of the predictive potential of the model developed below, short of a blinded clinical case-control study where we can proactively control all possible confounding variables. In addition, 25% of the data (293 samples) from each of the other 10 cohorts were also set aside for testing (Fig 2, Table 1). None of the samples in the Test set were used to develop the model. The remaining 75% of the data (869 samples) were further subdivided into a Training set (648 samples) and a Validation set (221 samples). Studies have shown that this sample size can be sufficient for >90% concordance in prediction accuracy, indicating low risk of overfitting [6870]. The Training and Validation data sets were used to evaluate six commonly used ML algorithms: Light Gradient Boosted Trees, Random Forest, Support Vector Machine, Neural Network, k Nearest Neighbors, and Logistic Regression classifiers. Combinations of model hyperparameters were evaluated for each of these algorithms using the Training and Validation sets. In addition to model hyperparameters, we also evaluated the effect of dichotomizing the VAF based on cutoff values ranging from 0.00 to 0.30. Each of the resulting 135,811 models was evaluated using the Validation set (Fig 2). Average unweighted batch accuracy was then used to identify the best model for each of the six ML algorithms. Average batch accuracies for all six models were comparable, ranging from 91.93% (66.67–100.00%) for Logistic Regression to 96.66% (83.33–100.00%) for Light Gradient Boosted Trees (S3 Table in S1 File, Fig 3a).

thumbnail
Fig 3. Selecting an optimal machine learning classifier.

(a) Batch accuracy for the different machine learning algorithms, with model hyperparameters that produced the highest average batch accuracy for the validation set. Error bars show the range of values, orange lines the median and green dashed lines the mean. (b) A Logistic Regression (LR) model was selected for its interpretability of feature coefficients. Error bars show 95% confidence intervals. (c) Receiver Operating Characteristic (ROC) curve for the LR model showing 0.9472 Area Under the Curve (AUC). The red line represents a random prediction. (d) 91.66% of lung cancer and 83.20% non-cancer samples had predicted probabilities of >0.9 and <0.1 for cancer, respectively. Predicted probability of >0.5 (red line) predicts cancer and <0.5 no cancer.

https://doi.org/10.1371/journal.pone.0307232.g003

Logistic regression (LR) model

Since average batch accuracies for all six ML models were comparable (91.93–96.66%), we selected the LR model for the direct real-world interpretability of its parameters. Specifically, the log odds-ratio (LOR) of model predictions and feature weights. The LR model can be formulated as (1) where p is the predicted probability of cancer, is the odds-ratio for cancer prediction, β0 is the intercept, βi are the weights for each of the features (predictors) xi, and n is the number of features. xi = 1 or 0 corresponding to the presence or absence of the mutation with VAF > 0.01. All further testing was performed using the LR model. The weights βi for each of the predictors (mutations) can be interpreted as the LOR for the probability of cancer when mutation = 1 versus 0, controlling for all other predictors.

For the selected LR model, VAF was dichotomized as mutation = 1 for VAF>0.01 and 0 otherwise, class_weight = {0:1, 1:4}, solver = ‘saga’, and default values for all other hyperparameters (S3 Table in S1 File). The accuracy of the LR predictions for each of the ten datasets in the Validation set ranged from a mean of 66.66% (95% CI: 13.32–100%) to 100% (Fig 3b). The overall accuracy for the combined Validation set was 94.57% (95% CI: 91.58–97.56%). For the Validation set, the receiver operating characteristic (ROC) curve had an area under the curve (AUC) of 0.9472, suggesting strong diagnostic ability (Fig 3c). With a cut-point of 0.5, sensitivity and specificity were 97.50% (95% CI: 94.71–100.00%) and 89.62% (95% CI: 83.82–95.43%), respectively. More importantly, 91.66% of the cancer cases were unambiguously predicted with probability of cancer > 0.90 and 83.20% of the non-cancer cases were predicted with probability of cancer < 0.10 (Fig 3d).

The LR model was refit to the combined Training and Validation set, to incorporate all available non-test data. The resulting model was then used for testing. For 87 of the 98 mutations the LOR βi ≠ 0 (Fig 4b), with the other 11 mutations (βi = 0) not contributing to the LR model prediction. 47 of 87 mutations had βi > 0, suggesting that these mutations may increase the risk of cancer, while 40 mutations had βi < 0, suggesting a reduced risk, controlling for all other mutations. For nine of the 40 mutations with βi < 0, the mutation occurred more frequently in non-cancer samples than in lung cancer samples. For 23 of the 47 mutations with βi > 0 were found to occur more frequently in lung cancer samples (Fig 4a, S4 Table in S1 File). The LOR for 26 of the 98 mutations had p-value < 0.05 (Fig 4b, S4 Table in S1 File). Although the other 78 mutations individually have poor predictive power (p-value > 0.05), their combined effect can be highly predictive.

thumbnail
Fig 4.

Mutations with p-value<0.05 for log odds ratio βi (a) Fraction of lung cancer and non-cancer samples with variant allele fraction (VAF) > 0.01. (b) Log odds ratio and 95% confidence interval.

https://doi.org/10.1371/journal.pone.0307232.g004

Test results

The Test set consisted of 25% of the data set aside from each of the ten studies used for Training and Validation (127 lung cancer and 166 non-cancer blood samples), and all the data from eight additional studies (451 lung cancer and 379 non-cancer blood samples) (Fig 2, Table 1). The average batch accuracy across all 18 datasets for the LR model developed above was 85.07% (27.27–100.00%) (Fig 5a). One of the 18 datasets, prjeb47088, had a very low accuracy, 27.27% (95% CI: 8.66–45.88%), due to a low read depth of 36 reads per exome loci containing the 98 mutations, compared to the average read depth of 197 across all cohorts (Table 1). With such a low read depth, mutations with low VAF (<5%) cannot be reliably detected [71], which can affect predictive accuracy. This finding highlights the potential confounding (batch) effect associated with sequencing depth. However, the relatively small number of samples (22) in this dataset, accounting for only 1.96% of the 1,123 test samples (Table 1), did not significantly affect the overall accuracy of the model. The overall accuracy for the combined Test set was 90.25% (95% CI: 88.50–92.01%) (Fig 5a), comparable to the 94.57% accuracy for the combined Validation set (Fig 3b). In addition, the overall accuracy for the eight independent test cohorts alone was 89.88% (95% CI: 87.83–91.93%), excluding the 10 datasets where 25% of the data was set aside for testing. This was comparable to the 91.42% accuracy (95% CI: 88.06–94.77%) for the test data from other ten cohorts where 25% of the data was set aside for testing, suggesting that there was little if any confounding (batch) effect. The ROC curve had an AUC of 0.9036, suggesting strong diagnostic ability (Fig 5b). With a cut-point of 0.5, the sensitivity and specificity were 94.12% (95% CI: 92.20–96.04%) and 85.96% (95% CI: 82.98–88.95%). More importantly, 90.48% of the cancer cases were unambiguously predicted with probability of cancer > 0.90 and 74.86% of the non-cancer cases with probability of cancer < 0.10 (Fig 5c).

thumbnail
Fig 5. Predictive accuracy for test datasets.

(a) Batch and Combined accuracy for the LR model. Error bars show 95% CI. (b) ROC curve for the LR model showing 0.9036 AUC. The red line represents a random prediction. (d) 90.48% of lung cancer and 74.86% non-cancer samples had predicted probabilities of >0.9 and <0.1 for cancer, respectively. Predicted probability of >0.5 (red line) predicts cancer and <0.5 no cancer.

https://doi.org/10.1371/journal.pone.0307232.g005

The contribution of each of the 98 mutations to the LOR of predicted probability of cancer for each sample is either 0 or βi (Eq 1). This contribution varies considerably from sample to sample (S1 Fig in S1 File). For some mutations the contribution is clear. The S177A mutation in SEC61A2 with the most negative βi = -10.45 occurred in 55.96% of non-cancer samples but in only 5.61% of lung cancer samples (S1 Fig in S1 File). On the other hand, the T165P mutation in PRAMEF20 with the largest βi = +5.95 is predictive of cancer, controlling for all other mutations. However, counter-intuitively, the mutation occurred in 57.43% of non-cancer samples and in only 28.05% of lung cancer samples. These results highlight the complex combination of weights and mutations in the LR model contributing to the predicted probability of cancer.

Discussion

Differences in cohort selection and sequencing protocols, between lung cancer and non-cancer samples, can result in different variant profiles between the two sample sets. As a result, the Training and Validation stages may show artificially high accuracy but fail to accurately predict the probability on other independent datasets where the protocols may differ. To mitigate this confounding or batch effect we used datasets from ten different studies for training and validation, and eighteen different studies for testing. In addition, we used the average batch accuracy for the training and validation metric to mitigate the bias toward protocols for batches with larger number of samples. The comparable overall accuracy for the eight independent test cohorts alone (89.88%), excluding the datasets where 25% of the data was set aside for testing, and for the test data from other ten cohorts where 25% of the data was set aside for testing (91.42%), suggests that there was little if any confounding (batch) effect. Despite high test accuracy (Fig 5), these mitigation steps may not fully eliminate the batch-effect and the LR model developed here may not represent an optimal blood-based lung cancer screening panel. However, the LR model can serve as an effective starting point for developing an accurate lung cancer screening panel. To do so, a case-control study using a standardized protocol could be used to validate and optimize the LR model. The case-control study should also be more racially balanced than the studies used here, so that the screening panel is not racially biased.

The 98 potentially pathogenic mutations were selected because the associated gene sequences were highly conserved, the mutations were predicted to be damaging to gene function, and they could affect immune cell activity. Therefore, we expected most of the mutations to inhibit an effective anti-tumor immune response thus promoting tumor growth. However, many (48) of the 98 mutations had LOR < 0, suggesting that these mutations may be predictive of non-cancer cases. Chronic inflammation, triggered by tobacco smoke and other carcinogens, has been shown to promote tumorigenesis [72, 73]. We speculate that these mutations may inhibit such chronic inflammatory responses possibly inhibiting tumorigenesis.

It is important to note that the presence of some CH mutations in peripheral blood may be due to the presence of cancer. Cancer is known to alter the peripheral blood immune cell composition, by triggering the expansion of specific immune cell subtypes [74]. These immune cell subtypes may harbor different mutations, which would then be more abundant in the altered peripheral blood due to clonal hematopoiesis (CH). The CH mutations identified by our algorithm may indeed represent the result of CH expansion of specific mutations in specific immune cell subtypes, rather than the general presence of these mutations in hematopoietic stem and progenitor cells. Further investigation will be required to clarify the relationship.

In addition to providing a starting point for developing a cancer screening panel, this study identified a set of potential treatment targets. The mutations selected as predictors for the LR model were based on their potentially pathogenic role in TII cells. The selection criteria are detailed in the Results section under “Potentially pathogenic CH mutations in TII cells”. The values for the selection criteria for each mutation are listed in S2 Table in S1 File. Although the selection criteria used suggest the possibility that these mutations are pathogenic, we do not claim that that is necessarily so. While these mutations may not directly cause tumor growth, they may modulate it. In particular, the set of mutations with a p-value < 0.05 for the LOR (Fig 4), should be considered for further investigation. For example, the P453Q variant in MRC1 had a LOR of -7.47 (95% CI: -2.52–-10.44), suggesting an inhibitory effect against lung cancer. MRC1 codes for a membrane receptor protein that mediates macrophage endocytosis of glycoproteins [75]. MRC1 expression has been associated with inflammatory macrophage phenotype [76]. We speculate that the mutation inhibits the expression of MRC1, possibly reducing chronic inflammation and inhibiting tumorigenesis. In contrast, the T170P variant in RXRA had a LOR of +4.60 (95% CI: 2.69–6.52), suggesting a facilitative effect. RXRA encodes a nuclear receptor that acts as a transcription factor promoting target gene expression. RXRA is known to inhibit non-small cell lung cancer cell growth [77] and is a therapeutic target for lung cancer [78]. The 26 mutations with significant LOR (p-value < 0.05, Fig 4, S4 Table in S1 File) may represent potential immunotherapy targets for lung cancer.

The lung cancer and non-cancer samples included in this study did not include samples from other cancer types. Therefore, a lung cancer screening panel based on the LR model may detect other cancer types as well. In fact, 36 of the 98 potentially pathogenic CH mutations in TII cells for lung cancer overlap with those for breast cancer identified in a previous study [40]. These mutations could result in a positive prediction in the presence of breast cancer. The approach presented here should be extended to develop a pan-cancer screening panel to differentiate between different cancer types.

Conclusions

Early detection can significantly reduce the mortality rate due to lung cancer, yet the uptake of the currently approved low-dose computed tomography scan for lung cancer is limited. Here we present an approach for developing a blood-based screening panel that utilizes a set of potentially pathogenic clonal hematopoietic (CH) mutations detected in tumor infiltrating immune (TII) cells. While the effect of CH mutations in TII cells on solid tumor progression remains relatively unexplored, recent animal model studies suggest a potential role. We developed a logistic regression (LR) machine learning model for predicting the probability of lung cancer based on a set of 98 CH mutations in blood samples. The LR model demonstrated a high accuracy of 90.25% (95% CI: 88.50–92.01%) in a separate Test set. To mitigate batch-effects arising from differences in cohort selection and sequencing protocols, we used sequencing data from 18 different studies, though this may not fully eliminate the batch-effect. A case-control study with standardized protocols could be used to validate and refine the LR model and develop an accurate blood-based lung cancer screening panel.

Supporting information

S1 File. The following tables and figures are included in the supporting information file.

S1 Table. Cohort characteristics. S2 Table. Potentially pathogenic CH mutations in TII. S3 Table. Best ML model hyperparameters and batch accuracy. S4 Table. LR model coefficients. S1 Fig. Average contribution of individual mutations to the prediction of cancer probability for cancer and control samples.

https://doi.org/10.1371/journal.pone.0307232.s001

(ZIP)

Acknowledgments

We thank Drs. Robin T. Varghese and Harold Garner, VCOM, for their comments and suggestions.

References

  1. 1. Siegel RL, Miller KD, Wagle NS, Jemal A (2023) Cancer statistics, 2023. CA Cancer J Clin 73:17–48 pmid:36633525
  2. 2. de Koning HJ, van der Aalst CM, de Jong PA, et al (2020) Reduced lung-cancer mortality with volume CT screening in a randomized trial. N Engl J Med 382:503–513 pmid:31995683
  3. 3. Mazzone PJ, Silvestri GA, Patel S, Kanne JP, Kinsinger LS, Wiener RS, et al (2018) Screening for Lung Cancer: CHEST Guideline and Expert Panel Report. Chest 153:954–985 pmid:29374513
  4. 4. Zgodic A, Zahnd WE, Advani S, Eberth JM (2022) Low-dose CT lung cancer screening uptake: A rural–urban comparison. J Rural Health 38:40–53 pmid:33734492
  5. 5. Seijo LM, Peled N, Ajona D, et al (2019) Biomarkers in Lung Cancer Screening: Achievements, Promises, and Challenges. Journal of Thoracic Oncology 14:343–357 pmid:30529598
  6. 6. Lin AA, Nimgaonkar V, Issadore D, Carpenter EL (2022) Extracellular Vesicle–Based Multianalyte Liquid Biopsy as a Diagnostic for Cancer. Annu Rev Biomed Data Sci 5:269–292 pmid:35562850
  7. 7. Boyle P, Chapman CJ, Holdenrieder S, et al (2011) Clinical validation of an autoantibody test for lung cancer. Ann Oncol 22:383–389 pmid:20675559
  8. 8. Montani F, Marzi MJ, Dezi F, et al (2015) miR-Test: A Blood Test for Lung Cancer Early Detection. J Natl Cancer I 107:63 pmid:25794889
  9. 9. Mathios D, Johansen JS, Cristiano S, et al (2021) Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun 12:1–14
  10. 10. Klein EA, Richards D, Cohn A, et al (2021) Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Annals of Oncology 32:1167–1177 pmid:34176681
  11. 11. Finn OJ (2012) Immuno-oncology: understanding the function and dysfunction of the immune system in cancer. Annals of Oncology 23:viii6–viii9 pmid:22918931
  12. 12. Allen BM, Hiam KJ, Burnett CE, Venida A, DeBarge R, Tenvooren I, et al (2020) Systemic dysfunction and plasticity of the immune macroenvironment in cancer models. Nat Med 26:1125–1134 pmid:32451499
  13. 13. Asada S, Kitamura T (2021) Clonal hematopoiesis and associated diseases: A review of recent findings. Cancer Sci 112:3962–3971 pmid:34328684
  14. 14. Nguyen YTM, Fujisawa M, Nguyen TB, et al (2021) Tet2 deficiency in immune cells exacerbates tumor progression by increasing angiogenesis in a lung cancer model. Cancer Sci 112:4931–4943 pmid:34657351
  15. 15. Liu X, Sato N, Shimosato Y, et al (2022) CHIP‐associated mutant ASXL1 in blood cells promotes solid tumor progression. Cancer Sci 113:1182 pmid:35133065
  16. 16. Lee M, Li J, Li J, et al (2021) Tet2 inactivation enhances the antitumor activity of tumor-infiltrating lymphocytes. Cancer Res 81:1965–1976 pmid:33589517
  17. 17. Kleppe M, Comen E, Wen HY, et al (2015) Somatic mutations in leukocytes infiltrating primary breast cancers. NPJ Breast Cancer 1:1–6 pmid:28721364
  18. 18. Jiang A, Qin Y, Springer TA (2022) Loss of LRRC33-dependent TGFβ1 activation enhances anti-tumor immunity and checkpoint blockade therapy. Cancer Immunol Res 10:453–467
  19. 19. Han S, Liu ZQ, Chung DC, St Paul M, Garcia-Batres CR, Sayad A, et al (2022) Overproduction of IFNγ by Cbl-b-Deficient CD8+ T Cells Provides Resistance against Regulatory T Cells and Induces Potent Antitumor Immunity. Cancer Immunol Res 10:437–452
  20. 20. Zink F, Stacey SN, Norddahl GL, et al (2017) Clonal hematopoiesis, with and without candidate driver mutations, is common in the elderly. Blood 130:742–752 pmid:28483762
  21. 21. Fidler TP, Xue C, Yalcinkaya M, et al (2021) The AIM2 inflammasome exacerbates atherosclerosis in clonal haematopoiesis. Nature 592:296–301 pmid:33731931
  22. 22. Bowman RL, Busque L, Levine RL (2018) Clonal Hematopoiesis and Evolution to Hematopoietic Malignancies. Cell Stem Cell 22:157–170 pmid:29395053
  23. 23. Lin AE, Rauch PJ, Jaiswal S, Ebert BL (2022) Clonal Hematopoiesis: Confluence of Malignant and Nonmalignant Diseases. Annual Reviews of Cancer Biology 6:187–200
  24. 24. Jan M, Ebert BL, Jaiswal S (2017) Clonal hematopoiesis. Semin Hematol 54:43–50 pmid:28088988
  25. 25. Nam AS, Dusaj N, Izzo F, et al (2022) Single-cell multi-omics of human clonal hematopoiesis reveals that DNMT3A R882 mutations perturb early progenitor states through selective hypomethylation. Nat Genet 54:1514–1526 pmid:36138229
  26. 26. Challen GA, Goodell MA (2020) Clonal hematopoiesis: mechanisms driving dominance of stem cell clones. Blood 136:1590–1598 pmid:32746453
  27. 27. Sanmiguel JM, Eudy E, Loberg MA, Young KA, Mistry JJ, Mujica KD, et al (2022) Distinct Tumor Necrosis Factor Alpha Receptors Dictate Stem Cell Fitness versus Lineage Output in Dnmt3a-Mutant Clonal Hematopoiesis. Cancer Discov 12:2763–2773 pmid:36169447
  28. 28. Steensma DP, Bejar R, Jaiswal S, Lindsley RC, Sekeres MA, Hasserjian RP, et al (2015) Clonal hematopoiesis of indeterminate potential and its distinction from myelodysplastic syndromes. Blood 126:9–16 pmid:25931582
  29. 29. Steensma DP (2018) Clinical Implications of Clonal Hematopoiesis. Mayo Clin Proc 93:1122–1130 pmid:30078412
  30. 30. Leinonen R, Sugawara H, Shumway M, Collaboration on behalf of the INSD (2011) The Sequence Read Archive. Nucleic Acids Res 39:D19–D21
  31. 31. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760 pmid:19451168
  32. 32. Van der Auwera GA, Carneiro MO, Hartl C, et al (2013) From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinformatics 43:11.10.1–11.10.33 pmid:25431634
  33. 33. Franke KR, Crowgey EL (2020) Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms. Genomics Inform 18:e10 pmid:32224843
  34. 34. Carpi G, Gorenstein L, Harkins TT, Samadi M, Vats P (2022) A GPU-accelerated compute framework for pathogen genomic variant identification to aid genomic epidemiology of infectious disease: a malaria case study. Brief Bioinform 23:1–11 pmid:35945154
  35. 35. 10x Chromium (2019) Chromium Single Cell V(D)J Reagent Kits with Feature Barcoding technology for Cell Surface Protein, https://www.10xgenomics.com/support/single-cell-immune-profiling. https://www.10xgenomics.com/support/single-cell-immune-profiling. Accessed 17 Mar 2023
  36. 36. Cao Y, Wang X, Peng G (2020) SCSA: A cell type annotation tool for single-cell RNA-seq data. Front Genet 11:490 pmid:32477414
  37. 37. Pedregosa FABIANPEDREGOSA F, Michel V, Grisel OLIVIERGRISEL O, et al (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830
  38. 38. Seabold S, Perktold J (2010) Statsmodels: Econometric and Statistical Modeling with Python. In: PROC. OF THE 9th PYTHON IN SCIENCE CONF. pp 92–96
  39. 39. One-Off Coder (2023) Data Science Topics. In: https://datascience.oneoffcoder.com. https://datascience.oneoffcoder.com. Accessed 17 Mar 2023
  40. 40. Anandakrishnan R, Zyvoloski IJ, Zyvoloski LR, Opoku NK, Dai A, Antony V (2023) Potential immunosuppressive clonal hematopoietic mutations in tumor infiltrating immune cells in breast invasive carcinoma. Sci Rep 13:13131 pmid:37573441
  41. 41. Weinstein JN, Collisson EA, Mills GB, et al (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45:1113–1120 pmid:24071849
  42. 42. Elazezy M, Joosse SA (2018) Techniques of using circulating tumor DNA as a liquid biopsy component in cancer management. Comput Struct Biotechnol J 16:370–378 pmid:30364656
  43. 43. Petrackova A, Vasinek M, Sedlarikova L, Dyskova T, Schneiderova P, Novosad T, et al (2019) Standardization of Sequencing Coverage Depth in NGS: Recommendation for Detection of Clonal and Subclonal Mutations in Cancer Diagnostics. Front Oncol 9:851 pmid:31552176
  44. 44. Lambrechts D, Wauters E, Boeckx B, et al (2018) Phenotype molding of stromal cells in the lung tumor microenvironment. Nat Med 24:1277–1289 pmid:29988129
  45. 45. Laughney AM, Hu J, Campbell NR, et al (2020) Regenerative lineages and immune-mediated pruning in lung cancer metastasis. Nat Med 26:259–269 pmid:32042191
  46. 46. Sinjab A, Han G, Treekitkarnmongkol W, et al (2021) Resolving the spatial and cellular architecture of lung adenocarcinoma by multiregion single-cell sequencing. Cancer Discov 11:2506–2523 pmid:33972311
  47. 47. Sim NL, Kumar P, Hu J, Henikoff S, Schneider G, Ng PC (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40:W452–W457 pmid:22689647
  48. 48. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249 pmid:20354512
  49. 49. Altshuler DM, Durbin RM, Abecasis GR, et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65 pmid:23128226
  50. 50. Whalen S, Schreiber J, Noble WS, Pollard KS (2021) Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 23:169–181 pmid:34837041
  51. 51. Joshi A, Butle A, Hait S, Mishra R, Trivedi V, Thorat R, et al (2022) Osimertinib for lung cancer cells harboring low-frequency EGFR T790M mutation. Transl Oncol 22:101461 pmid:35653897
  52. 52. Formenti SC, Rudqvist NP, Golden E, et al (2018) Radiotherapy induces responses of lung cancer to CTLA-4 blockade. Nature Medicine 2018 24:12 24:1845–1851 pmid:30397353
  53. 53. Jia Q, Chu Q, Zhang A, et al (2021) Mutational burden and chromosomal aneuploidy synergistically predict survival from radiotherapy in non-small cell lung cancer. Commun Biol 4:1–7
  54. 54. Mao W, Chen R, Lu R, Wang S, Song H, You D, et al (2021) Germline mutation analyses of malignant ground glass opacity nodules in non- smoking lung adenocarcinoma patients. PeerJ 9:e12048 pmid:34540367
  55. 55. Collisson EA, Campbell JD, Brooks AN, et al (2014) Comprehensive molecular profiling of lung adenocarcinoma. Nature 511:543–550 pmid:25079552
  56. 56. Gillette MA, Satpathy S, Cao S, et al (2020) Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell 182:200–225.e35 pmid:32649874
  57. 57. Satpathy S, Krug K, Jean Beltran PM, et al (2021) A proteogenomic portrait of lung squamous cell carcinoma. Cell 184:4348–4371.e40 pmid:34358469
  58. 58. Schenk MW, Humphrey S, Hossain ASMM, et al (2021) Soluble guanylate cyclase signalling mediates etoposide resistance in progressing small cell lung cancer. Nat Commun 12:1–15
  59. 59. Hammerman PS, Voet D, Lawrence MS, et al (2012) Comprehensive genomic characterization of squamous cell lung cancers. Nature 489:519–525 pmid:22960745
  60. 60. Ribeiro-dos-Santos AM, Vidal AF, Vinasco-Sandoval T, Guerreiro J, Santos S, Ribeiro-dos-Santos Â, et al (2020) Exome Sequencing of Native Populations From the Amazon Reveals Patterns on the Peopling of South America. Front Genet 11:1359 pmid:33193622
  61. 61. Fischbach GD, Lord C (2010) The simons simplex collection: A resource for identification of autism genetic risk factors. Neuron 68:192–195 pmid:20955926
  62. 62. Auton A, Abecasis GR, Altshuler DM, et al (2015) A global reference for human genetic variation. Nature 526:68–74 pmid:26432245
  63. 63. Lyons JJ, Yu X, Hughes JD, et al (2016) Elevated basal serum tryptase identifies a multisystem disorder associated with increased TPSAB1 copy number. Nat Genet 48:1564–1569 pmid:27749843
  64. 64. Xu Q, Wu C, Zhu Q, et al (2022) Metagenomic and metabolomic remodeling in nonagenarians and centenarians and its association with genetic and socioeconomic factors. Nat Aging 2:438–452 pmid:37118062
  65. 65. Chambers JC, Abbott J, Zhang W, et al (2014) The South Asian Genome. PLoS One 9:e102645 pmid:25115870
  66. 66. Rodriguez-Flores J, O’Beirne S, Salit J, Kaner R, Downey R, Mezey J, et al (2018) Identification of Large Clones of Potentially Deleterious Somatic Mutations in the Small Airway Epithelium of Smokers Without Cancer. Am J Respir Crit Care Med 197:A1952
  67. 67. Park JS, Lee J, Jung ES, et al (2019) Brain somatic mutations observed in Alzheimer’s disease associated with aging and dysregulation of tau phosphorylation. Nat Commun 10:1–12
  68. 68. Kim SY (2009) Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinformatics 10:1–10
  69. 69. Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DSA, Nobel AB, et al (2006) Concordance among Gene-Expression–Based Predictors for Breast Cancer. New England Journal of Medicine 355:560–569 pmid:16899776
  70. 70. Dobbin KK, Zhao Y, Simon RM (2008) How Large a Training Set is Needed to Develop a Classifier for Microarray Data? Clinical Cancer Research 14:108–114 pmid:18172259
  71. 71. Kim J, Kim D, Lim JS, Maeng JH, Son H, Kang HC, et al (2019) The use of technical replication for detection of low-level somatic mutations in next-generation sequencing. Nat Commun 10:1–11
  72. 72. Takahashi H, Ogata H, Nishigaki R, Broide DH, Karin M (2010) Tobacco Smoke Promotes Lung Tumorigenesis by Triggering IKKβ- and JNK1-Dependent Inflammation. Cancer Cell 17:89–97
  73. 73. Swann JB, Vesely MD, Silva A, Sharkey J, Akira S, Schreiber RD, et al (2008) Demonstration of inflammation-induced cancer and cancer immunoediting during primary tumorigenesis. Proc Natl Acad Sci U S A 105:652–656 pmid:18178624
  74. 74. Wu Y, Ye S, Goswami S, Pei X, Xiang L, Zhang X, et al (2020) Clinical significance of peripheral blood and tumor tissue lymphocyte subsets in cervical cancer patients. BMC Cancer 20:1–12 pmid:32131750
  75. 75. Butler M, Morel AS, Jordan WJ, Eren E, Hue S, Shrimpton RE, et al (2007) Altered expression and endocytic function of CD205 in human dendritic cells, and detection of a CD205–DCL-1 fusion protein upon dendritic cell maturation. Immunology 120:362–371 pmid:17163964
  76. 76. Stengel S, Quickert S, Lutz P, et al (2020) Peritoneal Level of CD206 Associates With Mortality and an Inflammatory Macrophage Phenotype in Patients With Decompensated Cirrhosis and Spontaneous Bacterial Peritonitis. Gastroenterology 158:1745–1761 pmid:31982413
  77. 77. Brabender J, Danenberg K, Metzger R, et al (2002) The role of retinoid X receptor messenger RNA expression in curatively resected non-small cell lung cancer. Clin Cancer Res 8:438–443 pmid:11839661
  78. 78. Rigas JR, Dragnev KH (2005) Emerging Role of Rexinoids in Non-Small Cell Lung Cancer: Focus on Bexarotene. Oncologist 10:22–33 pmid:15632250