Multigroup prediction in lung cancer patients and comparative controls using signature of volatile organic compounds in breath samples

Early detection of lung cancer is a crucial factor for increasing its survival rates among the detected patients. The presence of carbonyl volatile organic compounds (VOCs) in exhaled breath can play a vital role in early detection of lung cancer. Identifying these VOC markers in breath samples through innovative statistical and machine learning techniques is an important task in lung cancer research. Therefore, we proposed an experimental approach for generation of VOC molecular concentration data using unique silicon microreactor technology and further identification and characterization of key relevant VOCs important for lung cancer detection through statistical and machine learning algorithms. We reported several informative VOCs and tested their effectiveness in multi-group classification of patients. Our analytical results indicated that seven key VOCs, including C4H8O2, C13H22O, C11H22O, C2H4O2, C7H14O, C6H12O, and C5H8O, are sufficient to detect the lung cancer patients with higher mean classification accuracy (92%) and lower standard error (0.03) compared to other combinations. In other words, the molecular concentrations of these VOCs in exhaled breath samples were able to discriminate the patients with lung cancer (n = 156) from the healthy smoker and nonsmoker controls (n = 193) and patients with benign pulmonary nodules (n = 65). The quantification of carbonyl VOC profiles from breath samples and identification of crucial VOCs through our experimental approach paves the way forward for non-invasive lung cancer detection. Further, our experimental and analytical approach of VOC quantitative analysis in breath samples may be extended to other diseases, including COVID-19 detection.

Introduction Lung cancer is the most common kind of cancer globally due to poor lifestyle and environmental pollution. In the USA, lung cancer has an incidence of 235,760 new cases with 131,880 deaths in 2020 and is by far the leading cause of cancer deaths among both men and women, making up almost 25% of all cancer fatalities [1]. Every year, a larger number of people die due to lung cancer than colon, breast, and prostate cancers combined [2]. The 5-year survival rate for lung cancer patients is relatively low compared to other diseases. Early detection of lung cancer is crucial to improve the survival of the patients [2]. For this purpose, many screening methods, including chest radiography, sputum cytology, low-dose spiral computer tomography, fluorescence bronchoscopy, and positron emission tomography have been used [3]. However, these procedures are complicated, expensive, and time-consuming, thus making them difficult for the poor or low-income patient groups. Therefore, lung cancer detection through breath analysis using volatile organic compounds (VOCs) can be a vital tool for its early detection in order to increase the survival chance.
VOCs are organic chemicals that exist in air, exhaled breath, etc. and have high vapor pressure at ambient temperature [4]. The analysis of VOCs in exhaled breath plays a vital role in the early detection of lung cancer [4]. Earlier studies showed that there are several thousand(s) of VOCs in human breath, e.g., formaldehyde (CH 2 O), acetaldehyde (C 2 H 4 O), acetone (C 3 H 6 O), 2-butanone (C 4 H 8 O), to name a few. So far, many VOCs have been identified and reported in breath samples of normal humans and patients with lung cancer [5]. For instance, in 1999, Phillips et al. reported over 3400 different VOCs present in exhaled normal human breath [6].
The combinations of some VOCs reported in the literature can comfortably discriminate the lung cancer patients from the healthy ones [7]. For instance, a recent study showed higher sensitivity of lung cancer detection among patients using the carbonyl VOCs present in the exhaled breath samples [8]. In other words, the molecular concentration data on the VOCs present in exhaled breath samples were collected from the patients to classify them into several groups, i.e., Healthy Control, Cancer, and Benign Nodule. In this direction, Bousamra et al. (2014) developed one quantitative analytical approach to differentiate early lung cancer from benign pulmonary nodules based on the molecular concentration of carbonyl compounds in exhaled breath samples [9]. In the existing analytical approaches, statistical methods including the t-test and Wilcoxon test were used to select the significant VOCs [7,9]. These methods are univariate in nature and mostly ignore the inter-VOC relationships (i.e., relationship among the VOCs), and there is a chance of spurious association of VOC with the patient classes [16]. Further, Li et al. (2015) developed a technique utilizing quaternary amino-oxy coated silicon micro-reactors [4] for selective capture and quantification of the ketones and aldehydes in the air [10][11][12] and exhaled breath [7][8][9]. They used five classification models, including generalized partial least squares, support vector machines, random forests, linear and quadratic discriminant analyses to classify the patients into lung cancer and control groups based on exhaled breath data [4]. Sufficient studies in the literature indicated that the carbonyl compounds in exhaled breath play a significant role in the non-invasive detection of lung cancer [7][8][9][10][11][12]. Furthermore, the proper identification of key carbonyl compounds through statistical and machine learning techniques requires further advances.
In a typical breath sample study, the molecular concentration data on several hundred(s) of endogenous and exogenous VOCs are usually observed over patients. Sometimes, it may not be experimentally possible to monitor data on all the VOCs and further use them in lung cancer detection. In other words, among these hundred(s) of VOCs, all may not be required for the patient classification or the predictive model building process (i.e., training the machine learning models and later use them for class label predictions). Therefore, it is pertinent to select/identify a few metabolic VOCs related to lung cancer as key and significant features for cancer detection. The selection of important features (here metabolic VOCs) out of many VOCs is called feature selection in machine learning [13]. Further, it is essential to determine the number of significant VOCs (e.g., feature size or dimension of VOC data), which can be used in the training of the classification model to predict the class type of lung cancer patients. The selection of significant VOCs will save the precious time and cost of data generation for all VOCs present in the breath samples. In other words, the researchers can focus on few VOCs instead of generating data on all the VOCs present in breath samples of the patients.
Therefore, in this study, we endeavor to classify the lung cancer patients based on the VOC molecular concentration data. We present an experimental approach for lung cancer prediction using the carbonyl VOCs present in breath samples. The VOC molecular concentration data are generated from the breath samples of the 414 subjects (156: lung cancer; 65: benign and 193: healthy control) through the unique silicon microreactor technology [8]. The breath samples are collected following the experimental protocol approved by the Institutional Review Board (IRB) at the University of Louisville, USA. We also present an analytical approach involving relevant VOC selection and further used them in lung cancer patient classification model training. This approach of VOC selection is statistically sound, robust, and does not require any probability distributional assumptions about the VOC data (for VOCs testing and selection). Further, we identified several informative VOCs present in exhaled breath samples to detect lung cancer patients. For instance, the developed models provided sufficient classification accuracy for lung cancer detection with a minimum of three VOCs. Also, we studied the effect of the various VOC combinations on the classification of lung cancer patients. Moreover, our developed experimental approach can be applied to detect COVID-19 patients using VOC data from the exhaled breath samples.
The remainder of the paper is organized as follows: (i) the material and methods section deals with detail protocols for the data generation, description of methodology; (ii) the results and discussion section mainly deals with presentation of obtained results along with their discussion; and (iii) the conclusion section summarizes the manuscript along with its future scope.

Breath sampling and data generation
This study recruited 156 patients with untreated lung cancer, 65 patients with benign pulmonary nodules, and 193 healthy control subjects to provide exhaled breath samples. The detailed subject demographic characteristics, disease information, and breath analysis data have been published [8]. In brief, there were 103 patients with early stages (Stages 0, I, and II) of lung cancer and 53 patients with late stages of lung cancer. Most lung cancer patients were current or former smokers (149). The healthy controls included 113 current or former smokers and 80 never smokers. The average ages of the lung cancer patients, patients with benign pulmonary nodules, and healthy subjects were 65.1, 54.2, and 49.4 years, respectively. The male percentages of these three subgroups were 51.9%, 49.2%, and 55.7%, respectively.
The detailed research protocol for the collection of exhaled breath samples was approved by the IRB, University of Louisville, USA (IRB #15.0711). The healthy control subjects were recruited from patient family members who were free of lung cancer or other chronic pulmonary disease. All patients with pulmonary nodules were recruited in the James Graham Brown Cancer Center and Jewish Hospital at the University of Louisville, USA. The diagnostic predictions from these breath analyses were confirmed by clinical diagnoses using following-up the CT scans, positron emission tomography scans or pathology of biopsy or surgically resected specimens.
Exhaled breath samples were collected in one-liter Tedlar bags (Sigma-Aldrich, USA) through normal exhalation allowing for the collection of a mixture of alveolar and tidal breath in one exhaled breath. Ambient clinic exam room air samples (1 L) were also collected to serve as a control of background carbonyl compounds in the collection room. Our previous studies have examined the detailed method of breath sample collection, evacuation of breath samples through the micro-reactors, and analysis of the samples [4,[7][8][9]. In a brief description, the subjects directly exhaled the breath into Tedlar bags through the Teflon tube from the mouth to provide one liter exhaled breath samples, thus providing a non-invasive collection technique that the patients readily accepted. After the collection of exhaled breath, the Tedlar bags were directly connected to the silicon micro-reactors through silica capillary tubes and septa. A vacuum was applied to draw the collected VOCs from the Tedlar bags through the fabricated microreactor [10][11][12] at a flow rate of 5 mL/min. The microreactor has thousands of micropillars coated with 2-(aminooxy)-N, N, N-trimethyl-ethanammonium (ATM) iodide. After complete deflation of the sample bags, ATM and ATM adducts were eluted by flowing methanol (~100 μL) from a pressurized vial through the microreactor and into a collection vial [10]. The eluent methanol solutions were directly analyzed using a hybrid linear ion trap-Fourier transform-ion cyclotron mass spectrometry (FT-ICR-MS) instrument (Finnigan LTQ-FT, Thermo Electron, Bremen, Germany) equipped with a TriVersa NanoMate ion source (Advion BioSciences, Ithaca, NY) and a nano-electrospray chip (inner nozzle diameter 5.5 μm). A 5 μL methanol solution of a known amount of 5 nmol of ATM-acetone-d6 adduct was added to each eluted methanol sample as an internal reference of FT-ICR-MS. The amount of captured carbonyl compounds was then determined by comparing the FT-ICR-MS signal abundance of ATM−acetone-d6 with those of other ATM−carbonyl adducts. The concentration of each compound in exhaled breath detected by FT-ICR-MS was then calculated from the amount of the captured carbonyl compounds with in terms of nanomoles per liter (nmole/L). The microreactor's carbonyl capture efficiencies and validity of the analyses have been characterized in our previous studies [10][11][12]. A flow chart of the bioassay for generating data is displayed in Fig 1. Further, the workflow of the proposed experimental approach, including data generation, feature selection, and the classification model development is also illustrated in Fig 1. This study was approved by the IRB at the University of Louisville (IRB #15.0711). An informed written consent form was reviewed by the subjects for participation. A signed consent form was obtained from the subjects before they participated in the study. There was no minor recruited for the study.

Support Vector Machine-Recursive Feature Elimination
Support Vector Machine-Recursive Feature Elimination (SVM-RFE) can be used for the selection of relevant VOCs (in a two group case) from the lung cancer VOC data [14,15]. Let, {x m , y m } � R N × {−1, 1} be the input given to the Support Vector Machine (SVM) model. Here, we wish to find a hyperplane that divides the patients for lung cancer (y m = 1) from that of control class (y m = −1) in such a way that the distance between the hyperplane and the closest point is maximum (Supplementary Document S1 in S1 File). Then the hyperplane can be written as: where, k i and b are the weight of i th VOC and bias, respectively. Here, we assume that the patients for the two classes are linearly separable. In other words, we can select two parallel support hyperplanes that separate the lung cancer and control classes in such a way that the distance between these two planes is maximized (Supplementary Document S1 in S1 File). For the lung cancer class (y m = 1), the supporting hyperplane can be written as: Here, the Eq 2 only holds good for the x which are support vectors (i.e., points closest to the separating hyperplane, Eq 1).
For control class (y m = −1), the support hyperplane can be written as: Now, we assume that every point must lie on either side of the respective support hyperplanes (Eq 2 and Eq 3) (their distance to the separating hyperplane is at least same as the distance between the support vectors and the separating hyperplane), which can be expressed as: We wish to maximize the distance between the lung cancer and control support hyperplanes given in Eqs 2 and 3, respectively. Here, special care must be taken to prevent any data points from falling in between these two support hyperplanes.
To maximize the distance between the support hyperplanes (Supplementary Document S1 in S1 File), we need to minimize kkk 2 2 under the constraint of Eq 4. Mathematically, the objective function can be written as: where, φ m (� 0) are Lagrange multipliers. The constraint on φ m follows from the fact that the constraint is an inequality. Here, k i 's are obtained by minimizing the objective function in Eq 5. The objective function (Eq 5) was optimized with respect to k i , b and the following expressions are obtained.
The value of k can be obtained through solving the system of linear equations given in Eq 6 and is expressed as: Here, one also need to solve for φ m , but this cannot be done by solving for Eq 6 when its gradient is zero, thus one need to take into account the constraint that φ m �0. Therefore, SVMs cannot be solved via a linear system of equations. Rather, optimization algorithms (e.g., gradient descent) are commonly used in SVM-based tools. For this purpose, we executed the svm function implemented in e1071 R package.
The k i 2 (� 0) (i.e., square of the i th element of k in Eq 7) is used as a metric for the ranking of the VOCs in the data [15] to evaluate the impact of the i th VOC on patients' classification [16]. In the SVM-RFE technique, VOCs are eliminated with the smallest k i 2 iteratively in a backward elimination manner and ranked VOC list is prepared at the end [15,17]. Here, the backward elimination means training the SVM-based machine learning models iteratively after removing the least significant VOC at each step. Moreover, most feature selection methods, such as SVM-RFE, are sensitive to slight permutations of the class labels [18]. The ranking of VOCs may also lead to the selection of spuriously associated VOCs and make the selection process unreliable [13,19,20]. Therefore, it is essential to select VOCs based on statistical testing instead of their ranks. For this purpose, we used the Bootstrap SVM-RFE (Boot-SVM-RFE) technique developed by Das et al. (2017) to select the most relevant VOCs for patient classification [19]. The Boot-SVM-RFE method is briefly described in the following section.

Bootstrap SVM-RFE
In the usual supervised setting, the M patients/samples (as columns) in the X, data matrix, either belonging to lung cancer or control, can be considered subjects/units in a population model, as shown in Eq 8.
where,  [15,19]. So, we set B = 500 as the literature showed that the number of bootstrap samples required for obtaining the distribution of the test statistic(s) must be sufficiently large [18,21] The B bootstrap data matrices are given as input to the SVM-RFE technique to compute the k i 2 scores (given in Eq 7), and subsequently, VOC ranking was performed on each of the B bootstrap data matrices. Let, P ib be a random variable (rv) that indicates the position of i th VOC (i.e., ranks obtained from the SVM-RFE) in the b th bootstrap data matrix. Then, another rv can be defined based on P ib (without loss of generality), given as: where, R ib in Eq 9 is the rank score of i th (i = 1, 2, . . ., N) VOC in the b th (b = 1, 2, . . ., B) bootstrap data matrix. Here, it may be noted that the distribution of the rank scores of VOCs, computed from a bootstrap data matrix, is symmetric around the median value, 0.5 (as rank scores are functions of ranks).
To decide whether i th VOC is relevant or not for the patient classification, the following null hypotheses are framed.
where, R i is the rank score for i th VOC over all possible bootstrap samples. To obtain the distribution of test statistic under H 0 , we defined another rv Z ib , as: ( Let r ib be another rv representing the rank assigned to (R ib −0.5) (after arranging in ascending order of their magnitudes). In other words, for each row (which denotes the VOC), (R ib −0.5) was computed and corresponding rank, r ib , was assigned to (R ib −0.5.
To test H 0 vs. H 1, the test statistic for i th VOC, W i , is developed and is given as: where, U ib = Z ib r ib . In other words, W i (Eq 11) is the sum of the ranks of positive (R ib −0.5) for i th VOC over B bootstrap samples. Further, U ib in Eq 11 is a Bernoulli rv, and its probability mass function can be given as: Here, the expected value and variance of W i in Eq 11 under H 0 can be obtained as: The variance of W i becomes: As B is sufficiently large, then under the central limit theorem, the distribution of W i , given in Eq 11, becomes: Through Eq 15, the p-value for i th (i = 1, 2, . . ., N) VOC was computed, and similarly, this testing procedure was repeated for the remaining N-1 VOCs. In other words, the above statistical test was repeated for N times to compute the statistical significance values for the VOCs.
Let, p 1 , p 2 ,. . .,p N be the corresponding p-values for all the VOCs, and α be the desired significance level. Hence, we employed Hochberg's procedure [1] to correct the multiple testing problem and computed the adjusted (adj.) p-values for the VOCs. The algorithm for Hochberg's procedure [22] is as follows: First, the p-values of the VOCs were sorted in increasing order of their magnitude, shown as: p (1) , p (2) ,. . .,p (N) , where, p (.) is the rank ordered p-value and (i) stands for the i th ordered value (i.e., p (1) : smallest p-value, p (2) : second smallest p-value and so on).
Step 1. If p (N) >α, then retain corresponding null hypothesis and go to the next step. Else reject it and stop.
Step i = 2,3,. . ., N−1. If p (N−i+1) >α/i, then retain the corresponding null hypothesis and go to the next step. Else reject all remaining hypotheses and stop.
Step N. If p (1) >α/N, then retain the corresponding null hypothesis. Else reject it. Now, the adj. p-values are given recursively beginning with the largest p-value [1]: ( Based on the computed adj. p-values, the relevant VOCs were selected from the data. In other words, a lesser value of adj. p-value indicates more relevance of the VOC for the patient classification and vice-versa. Similarly, this procedure was applied to select the significant VOC for the patient classification into classes, such as benign vs. control, (lung cancer + benign) vs. control, (control + benign) vs. lung cancer. The outline and key analytical steps of the VOC selection process are shown in Fig 2. The effects of the significant VOCs on the patient classification were studied through an SVM classifier (with linear basis function). Under this setting, the VOCs (and their molecular concentration data) were given as inputs to the SVM classifier to compute the classification-based performance metrics. Further, impacts of the VOCs on the classification of patients under five different cases, including Case I: Cancer vs. Control; Case II: Cancer vs. Benign; Case III: Benign vs. Control; Case IV: (Benign + Cancer) vs. Control; and Case V: (Control + Benign) vs. Cancer were assessed through performance metrics, such as mean classification accuracy (CA) and standard error (SE) in CA computed through a varying sliding window size technique [18,19]. Here, we used this technique to study the importance of rankings of VOCs (obtained from a feature selection method, e.g., Boot-SVM-RFE) through training a classification model. In other words, the sliding

PLOS ONE
windows are VOC intervals that literally "slide" across the whole VOC list, preferably by some constant distance and CA is computed for each sliding window. Sliding windows can overlap or mutually exclusive. A brief description about the varying sliding window size technique is given in Supplementary Document S2 in S1 File. Then, the mean CA and SE in CA were computed over the sliding windows. In other words, the VOC set (of size n (n � N)), which provides maximum discrimination between the subjects/patients of 2 groups through CA, will be considered as the optimal size of the VOC set. The expressions for mean CA and SE in CA computed through varying sliding window size technique are given in Eqs 17 and 18.
SE CA ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi The total number of windows denoted as K in Eqs 17 and 18 can be defined in Eq 19.
where, n is the size of the VOC set, S is the size of the windows (i.e., size refers to the number of ranked VOCs), and L is the sliding length. The values of n, S, L, and K are given in Supplementary Document S2 in S1 File. The optimal size of the VOC set was computed through training the SVM-based classification model

Results and discussion
We observed the carbonyl VOCs having carbon atoms ranging from one (formaldehyde:  Fig 1 showed the capturing of typical relative abundances of these VOCs in breath samples of lung cancer patients, benign, and healthy controls. Chemical structure-wise, isomeric ketones and aldehydes are indistinguishable by direct infusion of one-dimensional FTICR-MS. However, the measured molecular weight at a resolving power of 200,000 provides their accurate chemical formulas. Separation and structure identification of some important isometric ketones and aldehydes was done through FT-ICR-MS/MS and GC-MS technologies. Further, the summary statistic(s) of the 27 considered VOCs for three different patient classes, i.e., control, benign, and lung cancer, are given in S1 Table. The mean values of the VOCs, including C 3 H 6 O, C 2 H 4 O, etc., are higher than others across all the patient classes, which indicated their higher average concentrations in breath samples (S1 Table). All the 27 VOCs may not be considered as biomarkers for lung cancer detection; therefore, we used the Boot-SVM-RFE technique to identify the significant VOCs for various combinations of patient classes. Here, five different classes were used, i.e., Case I: Cancer (156) vs.  (Fig 3). Through the Boot-SVM-RFE, the statistical significance values (p-values) for the VOCs were computed for five different classification problems and shown in Table 1  , were found to be statistically significant (at 1% level of significance) for all the cases of classification (Table 1). This finding indicated that these VOCs can be used as biomarkers for the patient classification. Broadly, we found 16 VOCs as statistically significant, at least in one case of patients' classification ( Table 1). The summary statistic(s) for these key VOCs for the three patients' classes are shown in  (Fig 3,  Table 2). Similarly, for cancer vs. benign, benign vs. control, and (benign + cancer) vs. control classifications, 15 VOCs were found to be statistically significant at the 1% level of significance (Fig 3, Table 1). Further, for (control + benign) vs. cancer classification, 13 VOCs were found to be statistically significant ( Table 1). The rankings of the VOCs for five different patient  Table 3. For instance, the VOCs such as C 4 H 8 O and C 4 H 8 O 2 ranked 1 and 2 respectively for cancer vs. control classification (Table 3). These ranked VOCs can be used as biomarkers for cancer detection with respect to healthy controls. Similar interpretations about the ranking of the VOCs on other classifications can be made, as shown in Table 3. The empirical distributions of the significant VOCs are shown in Supplementary Document S4 in S1 File.
We also studied different combinations of the VOCs on classifying the patients into various classes, such as cancer vs. control, cancer vs. benign, benign vs. control, (benign + cancer) vs. control, and (benign + control) vs. cancer, through the SVM based classification model. Here, the classification accuracy was computed through a five-fold cross-validation technique for each classification problem. The cross-validations of the data were repeated 500 times by taking different combinations of VOCs (based on their ranks) for each classification problem. Then, the mean and standard error of the classification accuracies were computed over the 500 runs through a varying sliding window size technique [13,16,19]. The results are shown in     (Table 3) were included in the data (Table 4). Similar interpretations can be made for other patients' classification, such as cancer vs. benign, benign vs. control, (benign + cancer) vs. control, and (benign + control) vs. cancer. However, we observed consistently better results when the top-ranked nine VOCs (as given in Table 3) were considered in all the five classification problems. We also performed similarity analysis among the key detected VOCs across all the patients, and the results are shown in Fig 4. For instance, the distance-based similarity analysis of the 16 VOCs over all the patients is shown in Fig 4A. The results indicated that the VOCs, such as  (Fig 4A). These two VOCs have less similarity with others across all the patients (Fig 4A). The remaining VOCs are clustered together, which indicates their similarity over the samples irrespective of the cancer classes ( Fig 4A). The correlation among the VOCs over all the patients is shown in Fig 4B. (Fig 4B). Further, the remaining VOCs are somewhat positively correlated with others. This analysis indicated a similarity in the molecular concentrations of the VOCs across the samples/patients observed through FT-ICR-MS technology. This study demonstrates that few detected markers in exhaled breath samples can be used as marker signatures for distinguishing patients with lung cancer from patients with benign pulmonary nodules and healthy subjects (Supplementary Document S5 in S1 File). Here, the VOC markers for cancer detection are identified through the statistically sound Boot-SVM-RFE technique. Through this technique, a statistically meaningful measure, i.e., adjusted p-value or FDR, was computed after correcting the multiple hypothesis testing problems. Then the adjusted p-value or FDR was assigned to each VOC, and the significant VOCs were selected based on these computed values. This measure is easily interpretable by experimental biologists and lab users, as the values are well defined in [0, 1]. In other words, a lower p-value indicates a more informative VOC and vice-versa. The random resampling procedure, i.e., bootstrap method, used in the Boot-SVM-RFE can eliminate the spurious and arbitrary association among the VOCs while detecting marker signatures [13,16,19]. Further, the Boot-SVM-RFE is more robust and does not require any distributional assumptions of the VOC data to obtain the distribution of test statistic(s). After selecting a few VOC markers, we trained the machine learning (i.e., SVM) based classification models to establish their relevance in lung cancer patients' classification. Our study found that instead of using all the VOCs, one can focus on few (e.g., 3, 5, or 7) marker VOCs to detect lung cancer patients more robustly and accurately. This approach will be less time-consuming and require lesser resources to detect lung cancer based on exhaled breath samples using FT-ICR-MS technology. Here, we have narrowed down our search to a few VOCs instead of focusing on all the VOCs present in breath samples, which in turn will save the time and cost of the experiments. The major advantages of our experimental approach include: first, the microreactors are designed with thousands of micropillars to provide higher capture rates of carbonyl compounds (e.g., VOCs) in breath samples [10][11][12]. Second, chemo-selective capture of carbonyl compounds through amino-oxy reactions simplifies the spectrum of compounds to be quantitated. Both the steps are well established in the literature and can be further used to detect other diseases, including COVID19 using exhaled breath samples. Third, a statistically efficient technique, Boot-SVM-RFE [19], was used to detect the markers using the VOC molecular concentration data. Fourth, in-silico validation of the VOC signatures through training machine learning-based classification models. In other words, our experimental approach includes VOC molecular data generation through microreactors based FT-ICR-MS technology and statistical analysis of the data using efficient statistical and machine learning techniques.

Conclusion
Early diagnosis of lung cancer is a key factor for increasing its survival rates among the patients. The analysis of carbonyl compounds present in exhaled breath of the patients is a promising non-invasive tool for the diagnosis of lung cancer at the early stage. In other words, the presence of metabolic carbonyl organic compounds in exhaled breath can play a vital role in the early detection of lung cancer patients, which will surely enhance the survival of the patients. Hence, the identification and characterization of the key metabolic VOCs using proper analytical approach and further using them in developing the classification models will play an important role in the quick and non-invasive detection of lung cancer. Therefore, in this study, we proposed an experimental approach to identify the key VOCs through the Boot-SVM-RFE technique and used these key VOCs to distinguish the patients with lung cancer from the benign pulmonary disease and healthy control classes.
Our analytical findings indicated that fewer VOCs can be used for lung cancer detection with sufficient classification accuracy. For instance, seven common key VOCs, including C 4  In this study, we used linear basis function in Boot-SVM-RFE technique, it will be interesting to study other non-linear basis functions or tree based models (e.g., random forests) to capture non-linear association among the VOCs. Further, our experimental and analytical approach of VOC quantitative analysis in breath samples may be extended to other diseases, including COVID19 detection. Besides, the analytical method used in this study can be applied to high-throughput gene expression studies, including RNA-sequencing and single-cell RNA-sequencing to select gene/bio-markers for the identification of cancer patients or cell types. Also, the reported experimental approach can be applied to other urine, saliva, and blood bio-assays based genetic studies to predict the phenotypes by identifying the organic compound based bio-signatures.
Supporting information S1 File. Supplementary documents S1-S5. This file contains supporting documents from S1 to S5. Supplementary Document S1: SVM training for two class classification; Supplementary