Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multigroup prediction in lung cancer patients and comparative controls using signature of volatile organic compounds in breath samples

  • Shesh N. Rai ,

    Contributed equally to this work with: Shesh N. Rai, Samarendra Das

    Roles Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing

    shesh.rai@louisville.edu (SNR); samarendra.das@louisville.edu (SD)

    Affiliations Biostatistics and Bioinformatics Facility, Brown Cancer Center, University of Louisville, Louisville, KY, United States of America, School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY, United States of America, Hepatobiology and Toxicology Center, University of Louisville, Louisville, KY, United States of America, Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, United States of America, Biostatistics and Informatics Facility, Center for Integrative Environmental Research Sciences, University of Louisville, Louisville, KY, United States of America, Christina Lee Brown Envirome Institute, University of Louisville, Louisville, KY, United States of America

  • Samarendra Das ,

    Contributed equally to this work with: Shesh N. Rai, Samarendra Das

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    shesh.rai@louisville.edu (SNR); samarendra.das@louisville.edu (SD)

    Affiliations Biostatistics and Bioinformatics Facility, Brown Cancer Center, University of Louisville, Louisville, KY, United States of America, School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY, United States of America, ICAR-Directorate of Foot and Mouth Disease, Arugul, Bhubaneswar, Odisha, India, International Centre for Foot and Mouth Disease, Arugul, Bhubaneswar, Odisha, India, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, India

  • Jianmin Pan,

    Roles Data curation, Supervision, Writing – review & editing

    Affiliation Biostatistics and Bioinformatics Facility, Brown Cancer Center, University of Louisville, Louisville, KY, United States of America

  • Dwijesh C. Mishra,

    Roles Supervision, Writing – review & editing

    Affiliation ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, India

  • Xiao-An Fu

    Roles Data curation, Funding acquisition, Investigation, Project administration, Supervision, Validation, Writing – review & editing

    Affiliation Department of Chemical Engineering, University of Louisville, Louisville, KY, United States of America

Abstract

Early detection of lung cancer is a crucial factor for increasing its survival rates among the detected patients. The presence of carbonyl volatile organic compounds (VOCs) in exhaled breath can play a vital role in early detection of lung cancer. Identifying these VOC markers in breath samples through innovative statistical and machine learning techniques is an important task in lung cancer research. Therefore, we proposed an experimental approach for generation of VOC molecular concentration data using unique silicon microreactor technology and further identification and characterization of key relevant VOCs important for lung cancer detection through statistical and machine learning algorithms. We reported several informative VOCs and tested their effectiveness in multi-group classification of patients. Our analytical results indicated that seven key VOCs, including C4H8O2, C13H22O, C11H22O, C2H4O2, C7H14O, C6H12O, and C5H8O, are sufficient to detect the lung cancer patients with higher mean classification accuracy (92%) and lower standard error (0.03) compared to other combinations. In other words, the molecular concentrations of these VOCs in exhaled breath samples were able to discriminate the patients with lung cancer (n = 156) from the healthy smoker and nonsmoker controls (n = 193) and patients with benign pulmonary nodules (n = 65). The quantification of carbonyl VOC profiles from breath samples and identification of crucial VOCs through our experimental approach paves the way forward for non-invasive lung cancer detection. Further, our experimental and analytical approach of VOC quantitative analysis in breath samples may be extended to other diseases, including COVID-19 detection.

Introduction

Lung cancer is the most common kind of cancer globally due to poor lifestyle and environmental pollution. In the USA, lung cancer has an incidence of 235,760 new cases with 131,880 deaths in 2020 and is by far the leading cause of cancer deaths among both men and women, making up almost 25% of all cancer fatalities [1]. Every year, a larger number of people die due to lung cancer than colon, breast, and prostate cancers combined [2]. The 5-year survival rate for lung cancer patients is relatively low compared to other diseases. Early detection of lung cancer is crucial to improve the survival of the patients [2]. For this purpose, many screening methods, including chest radiography, sputum cytology, low-dose spiral computer tomography, fluorescence bronchoscopy, and positron emission tomography have been used [3]. However, these procedures are complicated, expensive, and time-consuming, thus making them difficult for the poor or low-income patient groups. Therefore, lung cancer detection through breath analysis using volatile organic compounds (VOCs) can be a vital tool for its early detection in order to increase the survival chance.

VOCs are organic chemicals that exist in air, exhaled breath, etc. and have high vapor pressure at ambient temperature [4]. The analysis of VOCs in exhaled breath plays a vital role in the early detection of lung cancer [4]. Earlier studies showed that there are several thousand(s) of VOCs in human breath, e.g., formaldehyde (CH2O), acetaldehyde (C2H4O), acetone (C3H6O), 2-butanone (C4H8O), to name a few. So far, many VOCs have been identified and reported in breath samples of normal humans and patients with lung cancer [5]. For instance, in 1999, Phillips et al. reported over 3400 different VOCs present in exhaled normal human breath [6].

The combinations of some VOCs reported in the literature can comfortably discriminate the lung cancer patients from the healthy ones [7]. For instance, a recent study showed higher sensitivity of lung cancer detection among patients using the carbonyl VOCs present in the exhaled breath samples [8]. In other words, the molecular concentration data on the VOCs present in exhaled breath samples were collected from the patients to classify them into several groups, i.e., Healthy Control, Cancer, and Benign Nodule. In this direction, Bousamra et al. (2014) developed one quantitative analytical approach to differentiate early lung cancer from benign pulmonary nodules based on the molecular concentration of carbonyl compounds in exhaled breath samples [9]. In the existing analytical approaches, statistical methods including the t-test and Wilcoxon test were used to select the significant VOCs [7, 9]. These methods are univariate in nature and mostly ignore the inter-VOC relationships (i.e., relationship among the VOCs), and there is a chance of spurious association of VOC with the patient classes [16]. Further, Li et al. (2015) developed a technique utilizing quaternary amino-oxy coated silicon micro-reactors [4] for selective capture and quantification of the ketones and aldehydes in the air [1012] and exhaled breath [79]. They used five classification models, including generalized partial least squares, support vector machines, random forests, linear and quadratic discriminant analyses to classify the patients into lung cancer and control groups based on exhaled breath data [4]. Sufficient studies in the literature indicated that the carbonyl compounds in exhaled breath play a significant role in the non-invasive detection of lung cancer [712]. Furthermore, the proper identification of key carbonyl compounds through statistical and machine learning techniques requires further advances.

In a typical breath sample study, the molecular concentration data on several hundred(s) of endogenous and exogenous VOCs are usually observed over patients. Sometimes, it may not be experimentally possible to monitor data on all the VOCs and further use them in lung cancer detection. In other words, among these hundred(s) of VOCs, all may not be required for the patient classification or the predictive model building process (i.e., training the machine learning models and later use them for class label predictions). Therefore, it is pertinent to select/identify a few metabolic VOCs related to lung cancer as key and significant features for cancer detection. The selection of important features (here metabolic VOCs) out of many VOCs is called feature selection in machine learning [13]. Further, it is essential to determine the number of significant VOCs (e.g., feature size or dimension of VOC data), which can be used in the training of the classification model to predict the class type of lung cancer patients. The selection of significant VOCs will save the precious time and cost of data generation for all VOCs present in the breath samples. In other words, the researchers can focus on few VOCs instead of generating data on all the VOCs present in breath samples of the patients.

Therefore, in this study, we endeavor to classify the lung cancer patients based on the VOC molecular concentration data. We present an experimental approach for lung cancer prediction using the carbonyl VOCs present in breath samples. The VOC molecular concentration data are generated from the breath samples of the 414 subjects (156: lung cancer; 65: benign and 193: healthy control) through the unique silicon microreactor technology [8]. The breath samples are collected following the experimental protocol approved by the Institutional Review Board (IRB) at the University of Louisville, USA. We also present an analytical approach involving relevant VOC selection and further used them in lung cancer patient classification model training. This approach of VOC selection is statistically sound, robust, and does not require any probability distributional assumptions about the VOC data (for VOCs testing and selection). Further, we identified several informative VOCs present in exhaled breath samples to detect lung cancer patients. For instance, the developed models provided sufficient classification accuracy for lung cancer detection with a minimum of three VOCs. Also, we studied the effect of the various VOC combinations on the classification of lung cancer patients. Moreover, our developed experimental approach can be applied to detect COVID-19 patients using VOC data from the exhaled breath samples.

The remainder of the paper is organized as follows: (i) the material and methods section deals with detail protocols for the data generation, description of methodology; (ii) the results and discussion section mainly deals with presentation of obtained results along with their discussion; and (iii) the conclusion section summarizes the manuscript along with its future scope.

Materials and methods

Breath sampling and data generation

This study recruited 156 patients with untreated lung cancer, 65 patients with benign pulmonary nodules, and 193 healthy control subjects to provide exhaled breath samples. The detailed subject demographic characteristics, disease information, and breath analysis data have been published [8]. In brief, there were 103 patients with early stages (Stages 0, I, and II) of lung cancer and 53 patients with late stages of lung cancer. Most lung cancer patients were current or former smokers (149). The healthy controls included 113 current or former smokers and 80 never smokers. The average ages of the lung cancer patients, patients with benign pulmonary nodules, and healthy subjects were 65.1, 54.2, and 49.4 years, respectively. The male percentages of these three subgroups were 51.9%, 49.2%, and 55.7%, respectively.

The detailed research protocol for the collection of exhaled breath samples was approved by the IRB, University of Louisville, USA (IRB #15.0711). The healthy control subjects were recruited from patient family members who were free of lung cancer or other chronic pulmonary disease. All patients with pulmonary nodules were recruited in the James Graham Brown Cancer Center and Jewish Hospital at the University of Louisville, USA. The diagnostic predictions from these breath analyses were confirmed by clinical diagnoses using following-up the CT scans, positron emission tomography scans or pathology of biopsy or surgically resected specimens.

Exhaled breath samples were collected in one-liter Tedlar bags (Sigma-Aldrich, USA) through normal exhalation allowing for the collection of a mixture of alveolar and tidal breath in one exhaled breath. Ambient clinic exam room air samples (1 L) were also collected to serve as a control of background carbonyl compounds in the collection room. Our previous studies have examined the detailed method of breath sample collection, evacuation of breath samples through the micro-reactors, and analysis of the samples [4, 79]. In a brief description, the subjects directly exhaled the breath into Tedlar bags through the Teflon tube from the mouth to provide one liter exhaled breath samples, thus providing a non-invasive collection technique that the patients readily accepted. After the collection of exhaled breath, the Tedlar bags were directly connected to the silicon micro-reactors through silica capillary tubes and septa. A vacuum was applied to draw the collected VOCs from the Tedlar bags through the fabricated microreactor [1012] at a flow rate of 5 mL/min. The microreactor has thousands of micropillars coated with 2-(aminooxy)-N, N, N-trimethyl-ethanammonium (ATM) iodide. After complete deflation of the sample bags, ATM and ATM adducts were eluted by flowing methanol (~100 μL) from a pressurized vial through the microreactor and into a collection vial [10]. The eluent methanol solutions were directly analyzed using a hybrid linear ion trap-Fourier transform-ion cyclotron mass spectrometry (FT-ICR-MS) instrument (Finnigan LTQ-FT, Thermo Electron, Bremen, Germany) equipped with a TriVersa NanoMate ion source (Advion BioSciences, Ithaca, NY) and a nano-electrospray chip (inner nozzle diameter 5.5 μm). A 5 μL methanol solution of a known amount of 5 nmol of ATM–acetone-d6 adduct was added to each eluted methanol sample as an internal reference of FT-ICR-MS. The amount of captured carbonyl compounds was then determined by comparing the FT-ICR-MS signal abundance of ATM−acetone-d6 with those of other ATM−carbonyl adducts. The concentration of each compound in exhaled breath detected by FT-ICR-MS was then calculated from the amount of the captured carbonyl compounds with in terms of nanomoles per liter (nmole/L). The microreactor’s carbonyl capture efficiencies and validity of the analyses have been characterized in our previous studies [1012]. A flow chart of the bioassay for generating data is displayed in Fig 1. Further, the workflow of the proposed experimental approach, including data generation, feature selection, and the classification model development is also illustrated in Fig 1.

thumbnail
Fig 1. Outline of the experimental approach used in this study.

Various steps undertaken in this study are shown in flow chart form. The various steps include (i) capturing of exhaled breath; (ii) capturing VOCs present in exhaled breath samples; (iii) generation of molecular concentration data (nmole/L) of VOCs through bioassay mass-spectrometry technique; (iv) feature (VOC) selection and classification model training and its application for lung cancer detection; and (v) validation of the selected VOCs through classification accuracy, literature search, and expert opinion.

https://doi.org/10.1371/journal.pone.0277431.g001

This study was approved by the IRB at the University of Louisville (IRB #15.0711). An informed written consent form was reviewed by the subjects for participation. A signed consent form was obtained from the subjects before they participated in the study. There was no minor recruited for the study.

Notation

Let, XN × M = [xi, m] be the VOC data matrix, where xi, m represents the observed measurement of the ith (i = 1, 2, …, N) VOC for the mth (m = 1, 2, …, M) patient; xm be the N-dimensional vector of observed values of N VOCs for mth patient; ym be the outcome variable for target class label of mth patient (i.e., control and lung cancer) and takes values {+1, -1} for lung cancer and control conditions respectively; M1 and M2 be the number of patients in lung cancer and control classes respectively (M1+M2 = M).

Support Vector Machine-Recursive Feature Elimination

Support Vector Machine-Recursive Feature Elimination (SVM-RFE) can be used for the selection of relevant VOCs (in a two group case) from the lung cancer VOC data [14, 15]. Let, {xm, ym} ϵ RN × {−1, 1} be the input given to the Support Vector Machine (SVM) model. Here, we wish to find a hyperplane that divides the patients for lung cancer (ym = 1) from that of control class (ym = −1) in such a way that the distance between the hyperplane and the closest point is maximum (Supplementary Document S1 in S1 File). Then the hyperplane can be written as: (1) where, ki and b are the weight of ith VOC and bias, respectively. Here, we assume that the patients for the two classes are linearly separable. In other words, we can select two parallel support hyperplanes that separate the lung cancer and control classes in such a way that the distance between these two planes is maximized (Supplementary Document S1 in S1 File).

For the lung cancer class (ym = 1), the supporting hyperplane can be written as: (2)

Here, the Eq 2 only holds good for the x which are support vectors (i.e., points closest to the separating hyperplane, Eq 1).

For control class (ym = −1), the support hyperplane can be written as: (3)

Now, we assume that every point must lie on either side of the respective support hyperplanes (Eq 2 and Eq 3) (their distance to the separating hyperplane is at least same as the distance between the support vectors and the separating hyperplane), which can be expressed as: (4)

We wish to maximize the distance between the lung cancer and control support hyperplanes given in Eqs 2 and 3, respectively. Here, special care must be taken to prevent any data points from falling in between these two support hyperplanes.

To maximize the distance between the support hyperplanes (Supplementary Document S1 in S1 File), we need to minimize under the constraint of Eq 4. Mathematically, the objective function can be written as: (5) where, φm (≥ 0) are Lagrange multipliers. The constraint on φm follows from the fact that the constraint is an inequality. Here, ki’s are obtained by minimizing the objective function in Eq 5. The objective function (Eq 5) was optimized with respect to ki, b and the following expressions are obtained.

(6)

The value of k can be obtained through solving the system of linear equations given in Eq 6 and is expressed as: (7)

Here, one also need to solve for φm, but this cannot be done by solving for Eq 6 when its gradient is zero, thus one need to take into account the constraint that φm≥0. Therefore, SVMs cannot be solved via a linear system of equations. Rather, optimization algorithms (e.g., gradient descent) are commonly used in SVM-based tools. For this purpose, we executed the svm function implemented in e1071 R package.

The ki2 (≥ 0) (i.e., square of the ith element of k in Eq 7) is used as a metric for the ranking of the VOCs in the data [15] to evaluate the impact of the ith VOC on patients’ classification [16]. In the SVM-RFE technique, VOCs are eliminated with the smallest ki2 iteratively in a backward elimination manner and ranked VOC list is prepared at the end [15, 17]. Here, the backward elimination means training the SVM-based machine learning models iteratively after removing the least significant VOC at each step. Moreover, most feature selection methods, such as SVM-RFE, are sensitive to slight permutations of the class labels [18]. The ranking of VOCs may also lead to the selection of spuriously associated VOCs and make the selection process unreliable [13, 19, 20]. Therefore, it is essential to select VOCs based on statistical testing instead of their ranks. For this purpose, we used the Bootstrap SVM-RFE (Boot-SVM-RFE) technique developed by Das et al. (2017) to select the most relevant VOCs for patient classification [19]. The Boot-SVM-RFE method is briefly described in the following section.

Bootstrap SVM-RFE

In the usual supervised setting, the M patients/samples (as columns) in the X, data matrix, either belonging to lung cancer or control, can be considered subjects/units in a population model, as shown in Eq 8. (8) where, xm is the N-dimensional vector of the VOC measurements of the mth patient/sample and ym the class label (e.g., cancer vs. control) mth patient/sample. Here, it is assumed that the M measurements of the different patients/samples (i.e., xmm = 1,2,…,M) are independent and identically distributed (iid), but the VOCs within the same sample may be correlated. In the bootstrap procedure, M units are randomly drawn from M population units in Eq 8 with replacement to constitute a bootstrap data matrix, i.e., (M units serve as M columns of X). This process is repeated B times to get B bootstrap data matrices, i.e., . Here, B (i.e., number of bootstrap samples) depends on several factors, such as the number of units in the population model in Eq 8 [15, 19]. So, we set B = 500 as the literature showed that the number of bootstrap samples required for obtaining the distribution of the test statistic(s) must be sufficiently large [18, 21]. Furthermore, the values of M for different classification problems including Cancer vs. Control, Cancer vs. Benign, Benign vs. Control, (Benign + Cancer) vs. Control, and Cancer vs. (Benign + Control) are 349, 221, 258, 414, and 414 respectively.

The B bootstrap data matrices are given as input to the SVM-RFE technique to compute the ki2 scores (given in Eq 7), and subsequently, VOC ranking was performed on each of the B bootstrap data matrices. Let, Pib be a random variable (rv) that indicates the position of ith VOC (i.e., ranks obtained from the SVM-RFE) in the bth bootstrap data matrix. Then, another rv can be defined based on Pib (without loss of generality), given as: (9) where, Rib in Eq 9 is the rank score of ith (i = 1, 2, …, N) VOC in the bth (b = 1, 2, …, B) bootstrap data matrix. Here, it may be noted that the distribution of the rank scores of VOCs, computed from a bootstrap data matrix, is symmetric around the median value, 0.5 (as rank scores are functions of ranks).

To decide whether ith VOC is relevant or not for the patient classification, the following null hypotheses are framed.

  1. H0: Ri ≤ 0.5 (ith VOC is not so relevant to patient classification)
  1. H1: Ri > 0.5 (ith VOC is relevant to patient classification)

where, Ri is the rank score for ith VOC over all possible bootstrap samples.

To obtain the distribution of test statistic under H0, we defined another rv Zib, as: (10)

Let rib be another rv representing the rank assigned to (Rib −0.5) (after arranging in ascending order of their magnitudes). In other words, for each row (which denotes the VOC), (Rib −0.5) was computed and corresponding rank, rib, was assigned to (Rib −0.5.

To test H0 vs. H1, the test statistic for ith VOC, Wi, is developed and is given as: (11) where, Uib = Zibrib.

In other words, Wi (Eq 11) is the sum of the ranks of positive (Rib −0.5) for ith VOC over B bootstrap samples. Further, Uib in Eq 11 is a Bernoulli rv, and its probability mass function can be given as: (12)

Here, the expected value and variance of Wi in Eq 11 under H0 can be obtained as: (13)

The variance of Wi becomes: (14)

As B is sufficiently large, then under the central limit theorem, the distribution of Wi, given in Eq 11, becomes: (15)

Through Eq 15, the p-value for ith (i = 1, 2, …, N) VOC was computed, and similarly, this testing procedure was repeated for the remaining N-1 VOCs. In other words, the above statistical test was repeated for N times to compute the statistical significance values for the VOCs.

Let, p1, p2,…,pN be the corresponding p-values for all the VOCs, and α be the desired significance level. Hence, we employed Hochberg’s procedure [1] to correct the multiple testing problem and computed the adjusted (adj.) p-values for the VOCs. The algorithm for Hochberg’s procedure [22] is as follows:

First, the p-values of the VOCs were sorted in increasing order of their magnitude, shown as: p(1), p(2),…,p(N), where, p(.) is the rank ordered p-value and (i) stands for the ith ordered value (i.e., p(1): smallest p-value, p(2): second smallest p-value and so on).

  1. Step 1. If p(N)>α, then retain corresponding null hypothesis and go to the next step. Else reject it and stop.
  2. Step i = 2,3,…, N−1. If p(Ni+1)>α/i, then retain the corresponding null hypothesis and go to the next step. Else reject all remaining hypotheses and stop.
  3. Step N. If p(1)>α/N, then retain the corresponding null hypothesis. Else reject it.

Now, the adj. p-values are given recursively beginning with the largest p-value [1]: (16)

Based on the computed adj. p-values, the relevant VOCs were selected from the data. In other words, a lesser value of adj. p-value indicates more relevance of the VOC for the patient classification and vice-versa. Similarly, this procedure was applied to select the significant VOC for the patient classification into classes, such as benign vs. control, (lung cancer + benign) vs. control, (control + benign) vs. lung cancer. The outline and key analytical steps of the VOC selection process are shown in Fig 2. The effects of the significant VOCs on the patient classification were studied through an SVM classifier (with linear basis function). Under this setting, the VOCs (and their molecular concentration data) were given as inputs to the SVM classifier to compute the classification-based performance metrics. Further, impacts of the VOCs on the classification of patients under five different cases, including Case I: Cancer vs. Control; Case II: Cancer vs. Benign; Case III: Benign vs. Control; Case IV: (Benign + Cancer) vs. Control; and Case V: (Control + Benign) vs. Cancer were assessed through performance metrics, such as mean classification accuracy (CA) and standard error (SE) in CA computed through a varying sliding window size technique [18, 19]. Here, we used this technique to study the importance of rankings of VOCs (obtained from a feature selection method, e.g., Boot-SVM-RFE) through training a classification model. In other words, the sliding windows are VOC intervals that literally "slide" across the whole VOC list, preferably by some constant distance and CA is computed for each sliding window. Sliding windows can overlap or mutually exclusive. A brief description about the varying sliding window size technique is given in Supplementary Document S2 in S1 File. Then, the mean CA and SE in CA were computed over the sliding windows. In other words, the VOC set (of size n (nN)), which provides maximum discrimination between the subjects/patients of 2 groups through CA, will be considered as the optimal size of the VOC set. The expressions for mean CA and SE in CA computed through varying sliding window size technique are given in Eqs 17 and 18.

(17)(18)
thumbnail
Fig 2. Outline of the Boot-SVM-RFE technique for significant VOC selection.

Data matrix has N rows as VOCs with concentrations of nmole/L and M columns as patients/samples. B is the number of bootstrap samples drawn from the data matrix. Vb and Rb’s are the VOC ranked list and VOC rank-score for the bth bootstrap sample. Null hypothesis represents ith VOC is not important for patients’ classification. α is the desired level of significance.

https://doi.org/10.1371/journal.pone.0277431.g002

The total number of windows denoted as K in Eqs 17 and 18 can be defined in Eq 19. (19) where, n is the size of the VOC set, S is the size of the windows (i.e., size refers to the number of ranked VOCs), and L is the sliding length. The values of n, S, L, and K are given in Supplementary Document S2 in S1 File. The optimal size of the VOC set was computed through training the SVM-based classification model (i.e., calculating the indices given in Eqs 17 and 18) using the five-fold cross-validation technique. The outline and workflow of the Boot-SVM-RFE for the key VOC selection are shown in Fig 2.

Results and discussion

We observed the carbonyl VOCs having carbon atoms ranging from one (formaldehyde: CH2O) to thirteen (tridecanal: C13H26O) in exhaled breath samples of healthy controls, benign nodules, and lung cancer patients. Through FT-ICR-MS technology, the molecular concentration data on VOCs with chemical formulas CH2O, C2H4O, C3H6O, C4H8O, C5H10O, C6H12O, C7H14O, C8H16O, C9H18O, C10H20O, C11H22O, C12H24O, C13H26O, C4H8O2, C2H4O2, C3H4O, C6H10O2, C9H16O2_HNE, C3H4O2, C4H6O2, C4H6O, C4H4O2, C5H8O, C7H6O, C7H11O, C13H22O, and C15H10O, were observed for healthy controls, benign nodules, and lung cancer patients. The common names of these VOCs are provided in Supplementary Document S3 in S1 File. The FT-ICR-MS spectra in Fig 1 showed the capturing of typical relative abundances of these VOCs in breath samples of lung cancer patients, benign, and healthy controls. Chemical structure-wise, isomeric ketones and aldehydes are indistinguishable by direct infusion of one-dimensional FTICR-MS. However, the measured molecular weight at a resolving power of 200,000 provides their accurate chemical formulas. Separation and structure identification of some important isometric ketones and aldehydes was done through FT-ICR-MS/MS and GC–MS technologies. Further, the summary statistic(s) of the 27 considered VOCs for three different patient classes, i.e., control, benign, and lung cancer, are given in S1 Table. The mean values of the VOCs, including C3H6O, C2H4O, etc., are higher than others across all the patient classes, which indicated their higher average concentrations in breath samples (S1 Table).

All the 27 VOCs may not be considered as biomarkers for lung cancer detection; therefore, we used the Boot-SVM-RFE technique to identify the significant VOCs for various combinations of patient classes. Here, five different classes were used, i.e., Case I: Cancer (156) vs. Control (193); Case II: Cancer (156) vs. Benign (65); Case III: Benign (65) vs. Control (193); Case IV: (Benign + Cancer) (221) vs. Control (193); Case V: (Control + Benign) (258) vs. Cancer (156) (Fig 3). Through the Boot-SVM-RFE, the statistical significance values (p-values) for the VOCs were computed for five different classification problems and shown in Table 1 and S2 Table. For instance, the VOCs, including C4H8O, C6H12O, C7H14O, C8H16O, etc., were found to be statistically significant (at 1% level of significance) for all the cases of classification (Table 1). This finding indicated that these VOCs can be used as biomarkers for the patient classification. Broadly, we found 16 VOCs as statistically significant, at least in one case of patients’ classification (Table 1). The summary statistic(s) for these key VOCs for the three patients’ classes are shown in Table 2. The co-efficient variation of the VOCs, including C3H4O, C6H10O2, C9H16O2, C5H8O, C7H11O, and C13H22O, are quite high as compared to others (Table 2). This observation indicated that the molecular concentrations of these VOCs have greater dispersion levels than others across all the classes (Table 2). Similar interpretations can be made for other VOCs.

thumbnail
Fig 3. Identification and characterization of key common VOCs for the patients’ classification under all the five cases.

(A) Venn diagram for the significant VOCs identified through the Boot-SVM-RFE technique. Significant VOCs are identified by setting the threshold for p-values at 10−15. Through this, 15, 15, 15, 13 and 13 significant VOCs are selected for the Case I, Case II, Case III, Case IV, and Case V, respectively. Case I: Lung cancer vs. Control; Case II: Cancer vs. Benign; Case III: Benign vs. Control; Case IV: (Benign + Cancer) vs. Control; Case V: (Control + Benign) vs. Cancer. (B) Characterization of the common significant VOCs. (C) Summary statistics of the seven significant VOCs with mean and median concentrations (nmole/L) over the whole sample (n = 414). SD: Standard deviation; SE: Standard error; CI: Confidence Interval.

https://doi.org/10.1371/journal.pone.0277431.g003

thumbnail
Table 1. Effects of the VOCs on lung cancer classifications.

https://doi.org/10.1371/journal.pone.0277431.t001

thumbnail
Table 2. Summary statistic(s) of the important VOCs for different patient populations.

https://doi.org/10.1371/journal.pone.0277431.t002

The impact of the VOCs on different patients’ classification, such as cancer vs. control, cancer vs. benign, benign vs. control, (benign + cancer) vs. control, and (control + benign) vs. cancer, was studied through the Boot-SVM-RFE technique. The false discovery rates (FDR) for all the VOCs were computed through the Boot-SVM-RFE for different patients’ classifications and shown in Table 1. For cancer vs. control classification, 13 VOCs, including C4H8O and C4H8O2, are found to be statistically significant when assessed through the FDR values (Fig 3, Table 2). Similarly, for cancer vs. benign, benign vs. control, and (benign + cancer) vs. control classifications, 15 VOCs were found to be statistically significant at the 1% level of significance (Fig 3, Table 1). Further, for (control + benign) vs. cancer classification, 13 VOCs were found to be statistically significant (Table 1). The rankings of the VOCs for five different patient classification problems are shown in Table 3. For instance, the VOCs such as C4H8O and C4H8O2 ranked 1 and 2 respectively for cancer vs. control classification (Table 3). These ranked VOCs can be used as biomarkers for cancer detection with respect to healthy controls. Similar interpretations about the ranking of the VOCs on other classifications can be made, as shown in Table 3. The empirical distributions of the significant VOCs are shown in Supplementary Document S4 in S1 File.

thumbnail
Table 3. Ranking of significant VOCs on different patients’ classification.

https://doi.org/10.1371/journal.pone.0277431.t003

We also studied different combinations of the VOCs on classifying the patients into various classes, such as cancer vs. control, cancer vs. benign, benign vs. control, (benign + cancer) vs. control, and (benign + control) vs. cancer, through the SVM based classification model. Here, the classification accuracy was computed through a five-fold cross-validation technique for each classification problem. The cross-validations of the data were repeated 500 times by taking different combinations of VOCs (based on their ranks) for each classification problem. Then, the mean and standard error of the classification accuracies were computed over the 500 runs through a varying sliding window size technique [13, 16, 19]. The results are shown in Table 4. The results indicated that for cancer vs. control classification, the top three VOCs, including C4H8O, C4H8O2, and C13H22O, provided the reliable mean classification accuracy of 92.00% with a standard error of 0.03. Further, the highest mean classification accuracy was observed when the nine top-ranked VOCs (Table 3) were included in the data (Table 4). Similar interpretations can be made for other patients’ classification, such as cancer vs. benign, benign vs. control, (benign + cancer) vs. control, and (benign + control) vs. cancer. However, we observed consistently better results when the top-ranked nine VOCs (as given in Table 3) were considered in all the five classification problems.

thumbnail
Table 4. Classification metrics for different combinations of VOCs.

https://doi.org/10.1371/journal.pone.0277431.t004

We also performed similarity analysis among the key detected VOCs across all the patients, and the results are shown in Fig 4. For instance, the distance-based similarity analysis of the 16 VOCs over all the patients is shown in Fig 4A. The results indicated that the VOCs, such as C4H80 and C10H20O, are clustered separately (Fig 4A). These two VOCs have less similarity with others across all the patients (Fig 4A). The remaining VOCs are clustered together, which indicates their similarity over the samples irrespective of the cancer classes (Fig 4A). The correlation among the VOCs over all the patients is shown in Fig 4B. The correlation analysis indicated that the VOCs, i.e., C5H8O, C13H26O, and C7H11O, are negatively correlated with other VOCs across all the samples (Fig 4B). Further, the remaining VOCs are somewhat positively correlated with others. This analysis indicated a similarity in the molecular concentrations of the VOCs across the samples/patients observed through FT-ICR-MS technology.

thumbnail
Fig 4. Similarity and correlation among the VOCs.

(A) Dendrogram plot for the VOCs. A dendrogram is a diagram that shows the hierarchical relationship between the VOCs and is obtained through hierarchical clustering method. (B) Correlation plot for the VOCs. The correlation values with white spots represent non-significant correlation among the VOCs at 5% level of significance.

https://doi.org/10.1371/journal.pone.0277431.g004

This study demonstrates that few detected markers in exhaled breath samples can be used as marker signatures for distinguishing patients with lung cancer from patients with benign pulmonary nodules and healthy subjects (Supplementary Document S5 in S1 File). Here, the VOC markers for cancer detection are identified through the statistically sound Boot-SVM-RFE technique. Through this technique, a statistically meaningful measure, i.e., adjusted p-value or FDR, was computed after correcting the multiple hypothesis testing problems. Then the adjusted p-value or FDR was assigned to each VOC, and the significant VOCs were selected based on these computed values. This measure is easily interpretable by experimental biologists and lab users, as the values are well defined in [0, 1]. In other words, a lower p-value indicates a more informative VOC and vice-versa. The random resampling procedure, i.e., bootstrap method, used in the Boot-SVM-RFE can eliminate the spurious and arbitrary association among the VOCs while detecting marker signatures [13, 16, 19]. Further, the Boot-SVM-RFE is more robust and does not require any distributional assumptions of the VOC data to obtain the distribution of test statistic(s). After selecting a few VOC markers, we trained the machine learning (i.e., SVM) based classification models to establish their relevance in lung cancer patients’ classification. Our study found that instead of using all the VOCs, one can focus on few (e.g., 3, 5, or 7) marker VOCs to detect lung cancer patients more robustly and accurately. This approach will be less time-consuming and require lesser resources to detect lung cancer based on exhaled breath samples using FT-ICR-MS technology. Here, we have narrowed down our search to a few VOCs instead of focusing on all the VOCs present in breath samples, which in turn will save the time and cost of the experiments.

The major advantages of our experimental approach include: first, the microreactors are designed with thousands of micropillars to provide higher capture rates of carbonyl compounds (e.g., VOCs) in breath samples [1012]. Second, chemo-selective capture of carbonyl compounds through amino-oxy reactions simplifies the spectrum of compounds to be quantitated. Both the steps are well established in the literature and can be further used to detect other diseases, including COVID19 using exhaled breath samples. Third, a statistically efficient technique, Boot-SVM-RFE [19], was used to detect the markers using the VOC molecular concentration data. Fourth, in-silico validation of the VOC signatures through training machine learning-based classification models. In other words, our experimental approach includes VOC molecular data generation through microreactors based FT-ICR-MS technology and statistical analysis of the data using efficient statistical and machine learning techniques.

Conclusion

Early diagnosis of lung cancer is a key factor for increasing its survival rates among the patients. The analysis of carbonyl compounds present in exhaled breath of the patients is a promising non-invasive tool for the diagnosis of lung cancer at the early stage. In other words, the presence of metabolic carbonyl organic compounds in exhaled breath can play a vital role in the early detection of lung cancer patients, which will surely enhance the survival of the patients. Hence, the identification and characterization of the key metabolic VOCs using proper analytical approach and further using them in developing the classification models will play an important role in the quick and non-invasive detection of lung cancer. Therefore, in this study, we proposed an experimental approach to identify the key VOCs through the Boot-SVM-RFE technique and used these key VOCs to distinguish the patients with lung cancer from the benign pulmonary disease and healthy control classes.

Our analytical findings indicated that fewer VOCs can be used for lung cancer detection with sufficient classification accuracy. For instance, seven common key VOCs, including C4H8O2, C13H22O, C11H22O, C2H4O2, C7H14O, C6H12O, and C5H8O, can be successfully used for classification purposes under the five different settings. In this study, we used linear basis function in Boot-SVM-RFE technique, it will be interesting to study other non-linear basis functions or tree based models (e.g., random forests) to capture non-linear association among the VOCs. Further, our experimental and analytical approach of VOC quantitative analysis in breath samples may be extended to other diseases, including COVID19 detection. Besides, the analytical method used in this study can be applied to high-throughput gene expression studies, including RNA-sequencing and single-cell RNA-sequencing to select gene/bio-markers for the identification of cancer patients or cell types. Also, the reported experimental approach can be applied to other urine, saliva, and blood bio-assays based genetic studies to predict the phenotypes by identifying the organic compound based bio-signatures.

Supporting information

S1 File. Supplementary documents S1-S5.

This file contains supporting documents from S1 to S5. Supplementary Document S1: SVM training for two class classification; Supplementary Document S2: Sliding Windows Size technique; Supplementary Document S3: List of VOC and their common names; Supplementary Document S4: Distribution of Key VOCs; Supplementary Document S5: Principal Component plots for visualizing patient classes.

https://doi.org/10.1371/journal.pone.0277431.s001

(DOCX)

S1 Table. Summary statistics of all the VOCs considered in this study.

https://doi.org/10.1371/journal.pone.0277431.s002

(XLSX)

S2 Table. Statistical significance values computed through Boot-SVM-RFE for all the VOCs used in the study for all patient class combinations.

https://doi.org/10.1371/journal.pone.0277431.s003

(XLSX)

Acknowledgments

In collaboration with Dr. M Bousamra, Dr. X Fu’s lab performed the bioassays. The support to Dr. Samarendra Das from ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India, is duly acknowledged.

References

  1. 1. American Cancer Society. Facts & Figures 2021. Atlanta, Ga.; 2021.
  2. 2. Chang JE: Lee DS, Ban SW, Oh J, Jung MY, Kim SH, et al. Analysis of volatile organic compounds in exhaled breath for lung cancer diagnosis using a sensor system. Sensors Actuators B Chem. 2018;255: 800–807.
  3. 3. Patz EF: Pinsky P, Gatsonis C, Sicks JD, Kramer BS, Tammemägi MC, et al. Overdiagnosis in Low-Dose Computed Tomography Screening for Lung Cancer. JAMA Intern Med. 2014;174: 269–274. pmid:24322569
  4. 4. Li M: Yang D, Brock G, Knipp RJ, Bousamra M, Nantz MH, et al. Breath carbonyl compounds as biomarkers of lung cancer. Lung Cancer. 2015;90: 92–97. pmid:26233567
  5. 5. Pauling L: Robinson AB, Teranishi R, Cary P. Quantitative analysis of urine vapor and breath by gas-liquid partition chromatography. Proc Natl Acad Sci U S A. 1971;68: 2374–2376. pmid:5289873
  6. 6. Phillips M: Herrera J, Krishnan S, Zain M, Greenberg J, Cataneo RN. Variation in volatile organic compounds in the breath of normal humans. J Chromatogr B Biomed Sci Appl. 1999;729: 75–88. https://doi.org/10.1016/S0378-4347(99)00127-9 pmid:10410929
  7. 7. Fu X-A: Li M, Knipp RJ, Nantz MH, Bousamra M. Noninvasive detection of lung cancer using exhaled breath. Cancer Med. 2014;3: 174–181. pmid:24402867
  8. 8. Schumer EM: Trivedi JR, van Berkel V, Black MC, Li M, Fu X-A, et al. High sensitivity for lung cancer detection using analysis of exhaled carbonyl compounds. J Thorac Cardiovasc Surg. 2015;150: 1517–1524. pmid:26412316
  9. 9. Bousamra M: Schumer E, Li M, Knipp RJ, Nantz MH, van Berkel V, et al. Quantitative analysis of exhaled carbonyl compounds distinguishes benign from malignant pulmonary disease. J Thorac Cardiovasc Surg. 2014;148: 1074–1081. pmid:25129599
  10. 10. Fu X-A: Li M, Biswas S, Nantz MH, Higashi RM. A novel microreactor approach for analysis of ketones and aldehydes in breath. Analyst. 2011;136: 4662–4666. pmid:21897949
  11. 11. Li M: Biswas S, Nantz MH, Higashi RM, Fu X-A. Preconcentration and Analysis of Trace Volatile Carbonyl Compounds. Anal Chem. 2012;84: 1288–1293. pmid:22145792
  12. 12. Li M: Biswas S, Nantz MH, Higashi RM, Fu X-A. A microfabricated preconcentration device for breath analysis. Sensors Actuators B Chem. 2013;180: 130–136.
  13. 13. Das S: Rai A, Mishra DC, Rai SN. Statistical approach for selection of biologically informative genes. Gene. 2018;655. pmid:29458166
  14. 14. Duan KB: Rajapakse JC, Wang H, Azuaje F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobioscience. 2005. pmid:16220686
  15. 15. Guyon I. Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn. 1998.
  16. 16. Das S: Rai SN. Statistical approach for biologically relevant gene selection from high-throughput gene expression data. Entropy. 2020;22. pmid:33286973
  17. 17. Guyon I: Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002.
  18. 18. Wang J: Chen L, Wang Y, Zhang J, Liang Y, Xu D. A Computational Systems Biology Study for Understanding Salt Tolerance Mechanism in Rice. Xu Y, editor. PLoS One. 2013;8: e64929. pmid:23762267
  19. 19. Das S: Meher PK, Rai A, Bhar LM, Mandal BN. Statistical Approaches for Gene Selection, Hub Gene Identification and Module Interaction in Gene Co-Expression Network Analysis: An Application to Aluminum Stress in Soybean (Glycine max L.). Tian Z, editor. PLoS One. 2017;12: e0169605. pmid:28056073
  20. 20. Das S: McClain CJ, Rai SN. Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges. Entropy. 2020;22: 427. pmid:33286201
  21. 21. Efron B: Tibshirani RJ. An Introduction to the Bootstrap. Boston, MA: Springer US; 1993. https://doi.org/10.1007/978-1-4899-4541-9
  22. 22. Benjamini Y: Hochberg Y. Multiple Hypotheses Testing with Weights. Scand J Stat. 1997;24: 407–418.