Fast and automated biomarker detection in breath samples with machine learning

Volatile organic compounds (VOCs) in human breath can reveal a large spectrum of health conditions and can be used for fast, accurate and non-invasive diagnostics. Gas chromatography-mass spectrometry (GC-MS) is used to measure VOCs, but its application is limited by expert-driven data analysis that is time-consuming, subjective and may introduce errors. We propose a machine learning-based system to perform GC-MS data analysis that exploits deep learning pattern recognition ability to learn and automatically detect VOCs directly from raw data, thus bypassing expert-led processing. We evaluate this new approach on clinical samples and with four types of convolutional neural networks (CNNs): VGG16, VGG-like, densely connected and residual CNNs. The proposed machine learning methods showed to outperform the expert-led analysis by detecting a significantly higher number of VOCs in just a fraction of time while maintaining high specificity. These results suggest that the proposed novel approach can help the large-scale deployment of breath-based diagnosis by reducing time and cost, and increasing accuracy and consistency.


Introduction
A typical human breath sample is thought to contain thousands of volatile organic compounds (VOCs), which are the products of metabolic, catabolic and exogenous exposure processes occurring continuously in the human body [1]. This makes breath a particularly interesting medium for metabolomics, which describes the individuals' specific phenotype and health status by measuring the metabolites present in the biological sample and changes in their expressions [1,2]. Breathomics [3] (i.e., breath metabolomics) can bring insight into all the metabolic processes in the body and thus provide comprehensive information about the organism's condition, additionally enabling non-invasive and rapid sample acquisition. Breath The total ion current (TIC) chromatogram plots along the RT dimension the cumulative abundance of ions (intensity). Each peak generally represents one specific VOC, although superposition of peaks also occurs [1]. (c) GC-MS abundance matrix presented as a heat map, with the x-axis being the mass-to-charge ratio (m/z) and the y-axis the retention time. https://doi.org/10.1371/journal.pone.0265399.g001

PLOS ONE
to strike the right balance between the inclusion of all the relevant VOCs and the omission of spectral noise and signal artefacts. The operator-subjective nature of the GC-MS data processing, combined with the data complexity, has the potential to introduce errors and variability to results on top of the observed biological variations from one subject to another, and thus the results may not be reproducible [17]. Additionally, expert-operated deconvolution is often labour-intensive time-consuming procedure [18]; the processing time of a single breath sample is estimated by experts as 60 to 120 minutes. The limitations outlined above call for better algorithms for GC-MS data processing, now possible by exploiting recent advances in machine learning and deep learning. A number of machine learning and deep learning applications in biomedical studies have been reported in the literature [19,20]. Several studies, such as [4,7,21], successfully applied machine learning in the area of breathomics. These reported applications, however, do not process GC-MS breath data directly but analyse a list of selected VOCs, provided by expert-led preprocessing, to classify patients and control group. Consequently, they allow for high-level GC-MS data classification by black-box like decisions without justifications, rather than detection of particular VOCs of interest. Thus, these methods provide limited information about individual's metabolomics condition. Moreover, such data processing (e.g., [4,7,21]) involves treating an

PLOS ONE
entire GC-MS sample as a single data point; in the case of clinical samples, obtaining largescale datasets desired for machine learning applications may be a challenging and demanding task. On the other hand, to the best of the authors' knowledge, there is a lack of studies applying machine learning directly on raw GC-MS breath data to detect VOCs of interest.
This study introduces a new approach to VOC detection in GC-MS samples: we propose the application of convolutional neural networks (CNNs) [22] to learn to detect VOCs in breath sample automatically and directly from raw GC-MS data, thereby bypassing the current labour-intensive, time-consuming and operator-subjective data preprocessing steps. CNNs [23] are a popular type of deep learning algorithms that is particularly effective in image analysis (e.g., [24][25][26]). CNNs can autonomously learn useful features directly from low-level data, e.g., pixels [27], and construct high-level features without human intervention. CNNs can also exploit geometrical properties of the data and thus adapt well to image-based tasks [28]. Presenting GC-MS data as abundance matrix enables one to see GC-MS data as an image (Fig 1c). Therefore, we propose a new approach, exploiting recent advancement in CNN applications for image analysis and pattern recognition, to learn to recognise ion patterns directly from raw abundance matrix. Ion patterns derived from specific compounds, although noisy, present unique features that distinguish them. A recognised ion pattern is effectively a recognised VOC, which in turn could be a biomarker of a given physical condition [29].
The promise of such an approach was first demonstrated in [30]. In that study, CNNs were shown to have considerably better performance than support vector machines [31] and shallow neural networks [32]. However, the study reported a high number of false positives and targeted only 8 VOCs, thus leaving open the question of reliability in detection and scalability. Nevertheless, the findings from [30] justify the application of CNN over shallow methods in the considered problem.
The CNN-based approach proposed here has two main stages (Fig 2, bottom); stage 1: data and CNN model preparation, and stage 2: raw GC-MS samples analysis. In stage 1, expert knowledge is exploited to create a dataset of target VOCs and their corresponding ion patterns on the raw GC-MS data. A CNN architecture is then trained on such dataset to learn to recognise ion patterns specific to the target VOCs. In stage 2, new raw breath samples are analysed to detect the targeted VOCs quickly and automatically. Firstly (in phase A, Fig 2), an entire GC-MS sample is scanned by the trained CNN network to provide a list of recognised patterns. Secondly (in phase B), domain-specific constraints are used to derive a final list of detected VOCs.
The robustness and scalability of the proposed CNN-based method were investigated on a dataset of 120 GC-MS samples and the set of 30 target VOCs. A wide range of target compounds generates various challenges: several target VOCs produce similar ion fragmentation patterns, some others have close RT positions and thus overlap, which may provide an obstacle to their accurate discrimination. In this study we tested different types of CNNs: VGG16 and VGG-like networks [33], residual neural networks [34] and densely connected convolutional networks [35] with different configurations, to compare their performance and select the most efficient strategy for GC-MS data processing.
The novel CNN-based approach proposed here provides a comprehensive system for fast and automated detection of any set of VOCs in raw GC-MS data, outperforming current expert-led deconvolution-based methods. The analysis of raw GC-MS breath data may reduce human-related errors and has the potential to detect compounds of very low abundances. Consequently, the proposed novel approach showed the ability to correctly detect VOCs missed by the current methods, while improving specificity and significantly reducing processing time. The system may support experts to put much more accurate hypotheses on the VOCs related to the specific health conditions. Moreover, by the significant acceleration of the VOC detection process, the CNN-based method allows for much quicker hypotheses validation on new GC-MS samples. To the best of the authors' knowledge, this is the first study that proposes a comprehensive system to reveal VOC ion patterns directly from raw GC-MS samples, with high sensitivity and specificity.

GC-MS sample dataset
Breath samples were obtained in a clinical study from 25 participants with different types of cancer receiving radiotherapy treatment. Four breath samples (prior and 1, 3, and 6 hours post radiation) were collected from each participant along with one environmental sample.
Each clinical sample was processed with GC-MS and stored in a data file containing both metadata and the abundance matrix A 2 R R�411 , where 411 is the number of measured m/z channels and R is the number of the mass spectrum measurements performed over retention time. As the instrumental scanning rate was approximately 6. 25 Hz, for about one-hour processing it gave approximately 22500 RT points of measurements. The first dimension of the abundance matrix A is called here the RT dimension, whereas the second one is called m/z dimension. Subsequently, all GC-MS data were processed with the current expert-led methods to identify VOCs contained in the sample and their RT positions. This process was completed for 120 clinical samples (see Methods).
The GC-MS sample dataset was divided into training and testing sets in the proportion 82/ 38: the training set contains 65 breath samples and 17 environmental samples associated with 17 randomly chosen participants, the testing set contains 30 breath samples and 8 environmental samples associated with remaining 8 participants. Both raw and expert-processed GC-MS samples from the training set were used to generate a VOC dataset (described later). The VOC dataset was applied in stage 1 (Fig 2) for the CNN model training, selected through cross-validation (CV) [36]. The testing GC-MS sample set was used for the assessment of the proposed CNN-based system performance. Precisely, the 38 raw samples from the testing set were used as inputs for automated VOC detection in stage 2, whereas their expert-processed equivalents made a ground truth for the system evaluation.

Target VOCs
A set of 30 target VOCs (Table 1) was designed to contain compounds commonly found in the breath, including alkanes, aldehydes, ketones, furans and siloxanes and sulfur-containing compounds [15]. The confidence in the expert-led identification process for these VOCs varies according to potential confounding factors. In particular, confounding factors include low concentrations of a VOC in a sample, which results in a low signal-to-noise ratio in GC-MS data. Propionic acid is an example of a target VOC reporting relatively low concentration in the clinical samples from the dataset. Other confounding factors are mass spectra overlapping caused by the co-elution of VOCs from the GC column, and the similar mass spectra among VOCs. For example, Octane and Hexanal frequently overlap, additionally sharing three of the five top ions. What is more, octane produces highly similar mass spectrum profiles as another target VOC-2,4-dimethylheptane. These VOCs also elute at a relatively close RT locations and thus may be difficult to differentiate with the current expert-led processing [15]. As a consequence of the abovementioned challenges, the process of expert-led VOC identification cannot be guaranteed to be error-free, resulting in a possibly noisy-labelled VOC dataset (stage 1) and ground truth (stage 2). The S1 Table in S1 File provides, for each target VOC, the details of its ion pattern, mean RT location and mean concentration in GC-MS sample dataset, and a number of VOC instances reported respectively in breath and environmental samples.

VOC dataset
The proposed CNN-based approach relies on the construction of a dataset of target VOC patterns, derived from raw GC-MS samples for the network training. Each VOC appears on the TIC chromatogram (Fig 1b) as one peak (sometimes overlapped) over a small segment of RT

PLOS ONE
axis, corresponding to a specific range of retention times when the VOC was eluting from the GC column. A sub-matrix of abundance matrix A, encompassing such an RT range, contains the ion pattern for that specific VOC. Thus, we defined a VOC data point as a matrix s 2 R d�411 , where the retention time window size was selected as δ = 80 (see Methods).
Data points corresponding to all target VOC instances, previously identified by expert-led processing in the training GC-MS samples, were extracted to form the labelled VOC dataset for network training. In addition to data points representing target VOCs, a negative class of data points was created from randomly chosen sub-matricesŝ of A that did not contain target VOCs. The size of the negative class was selected as equal to the total number of the data points representing the target VOCs, i.e., as half of the generated dataset. It was motivated by the compromise between significant unbalance in the VOC dataset (whether all the negative examples appearing in the training GC-MS samples would be included) and coverage of a wide variety of ion patterns that the negative class represents. In total, from the 82 training GC-MS samples, 3,736 VOC data points were extracted with a minimum of 22 and a maximum of 82 data points in each class of target compounds (S1 Table in S1 File).

Data augmentation and normalisation
Data augmentation, i.e., methods for enlargement of the dataset by insertion of unobserved data examples [37], is known to benefit machine learning models where data points are scarce [37]. Often these new examples are constructed from the observed ones, by the introduction of some variations to data points, which do not change their underlying distribution (here, VOC ion patterns) and classes.
Data augmentation was applied to the VOC dataset to increase the robustness of the training. We introduced two methods for VOC dataset augmentation: translation along RT and intensity variation (see Methods). Combined collectively, these two methods generated 100

Deep learning models
We assessed the learning and inference capabilities of the following recently developed deep CNN architectures, which achieved state-of-the-art performance in several image classification challenges: VGG16 [33], residual CNNs (ResNets) [34] and densely connected CNNs (Dense-Nets) [35]. We also tested smaller VGG-like networks with 4, 6 and 8 layers.
Along the m/z axis the values of a GC-MS abundance matrix are spatially only weakly correlated, as opposed to the data typically used in image-based tasks. Therefore, we tested the effectiveness of 1D filters (see Methods) and compared them with typically used 2D filters to investigate the suitability of such a network variation to capture the specific nature of correlation in GC-MS data. The details on tested CNN architectures, their implementations as well as parameters settings are given in S3.1-S3.11 Tables in S1 File.

CNN model training
To evaluate the CNN models' ability to learn ion patterns from the VOC dataset, and to select the best-performing configurations of the networks, five-fold cross-validation (CV) was applied. CV is a common technique used for model assessment and hyperparameter selection [36]. Each split of the VOC dataset for CV was performed at the level of participant.  Table 2 presents the classification performance of the models in the CV stage. All tested models achieved high accuracy (i.e., a proportion of data points for which the network label matched the ground truth), which confirms that ion patterns can be learned by CNNs from VOC data points derived directly from raw GC-MS data. The results also validate the hypothesis that 1D filters are more effective than standard 2D filters. Among the tested augmentation strategies, the best models' performances were obtained with the fully-augmented dataset. Shallower architectures, i.e., VGG-like, performed no worse than other deeper networks. The VGG-like network with 8 layers and 1D filters (VGG-8-1D) reported the highest accuracy.

PLOS ONE
The best performing hyperparameter configuration for each investigated CNN architecture type, i.e., VGG-8-1D, DenseNet-40-1D and ResNet-34-1D, was used to create the system for automated VOC detection in raw GC-MS samples (Fig 2, stage 2). Before the deployment to stage 2, the models were retrained on the entire, fully-augmented VOC dataset. S8 Fig in S1 File shows a learning curve for VGG-8-1D model.

Automated samples analysis and VOC detection
The trained networks were employed in stage 2 (Fig 2) to analyse the 38 raw GC-MS samples from the testing set. The proposed stage 2 analysis was composed of two phases: (A) scanning of each clinical sample with the network; (B) identification of VOC detections from the scan results.
Phase A-Scanning of a breath sample. The abundance matrix A of a clinical sample was scanned along the RT dimension. Precisely, the network was fed consecutive normalised submatrices s i 2 R d�411 of A, starting at retention point i (for each i in the RT dimension for that sample). Phase A classifies each s i into one of the 31 possible classes. Thus, scanning a single GC-MS sample required approximately 22500 queries of the network. The output of the process are two sequences: a sequence L A that contained approximately 22500 class labels, and a sequence T A that contained the classification confidence given by the network for each respective label (see Methods).
Phase B-VOC detection. To obtain a list of target VOCs detected in each clinical GC-MS sample, the pair of phase A output vectors for that sample was analysed. The following general properties of the GC-MS samples are considered in the phase B analysis:

PLOS ONE
(i). Along the RT dimension, a VOC is measured multiple times consecutively for the duration of the VOC elution (typically about 6 seconds).
(ii). The order of elution of VOCs from the GC column is usually constant across different samples, as indicated in Table 1.
(iii). Each VOC elutes from the GC column only once in a GC-MS process.
According to (i), a detection d j 2 D A was defined as a consecutive sub-sequence of L A of length at least γ with constant label values j (duration rule, see Methods). Subsequently, according to (ii), the system ignored detections that were far from the expected order (order rule), giving a setD A � D A . Finally, according to (iii), if multiple detections of one VOC occurred in any sample, the system selected the one with the highest detection confidence, derived from T A , expressing the confidence of the model in such detection (uniqueness rule); a set � D A �D A .
As a result, for each abundance matrix A, stage 2 delivered a list of detected VOCs along with their location along the RT axis and the detection confidence. A graphical representation of the output is provided in

Evaluation of the automated sample analysis and VOC detection
The performance of the system was assessed with two approaches: (1) including both the VOC localisation (along RT dimension) and its classification; and additionally, to report the system specificity, (2) including only the VOC presence in a sample.
(1)-VOC localisation and classification. VOC detection involves both localisation of the compound in the GC-MS sample and its classification. Therefore, each VOC detection in a sample was identified as a true positive (TP) if there was matching in both label and RT position with the ground truth for that sample, and as a false positive otherwise. Similarly, if a specific target VOC was not detected at the RT position reported by the ground truth, it was identified as a false negative. Note that a retention time window of any size in which the system and the expert both did not identify any target VOC could be considered as a true negative: therefore true negatives were not measured here.
System-derived ground truth corrections. To gain further insight on the results, we examined all the detectionsd j 2 D A (i.e., before applying order and uniqueness rules) that were not reported in the ground truth or had mismatching RT positions. The range of retention times in the ground truth was calculated for each VOC (see Methods). We observed that in most cases, the detectiond j was reported by the system in the expected (compatible) narrow RT range, characteristic for that VOC j. The upper bound probability of a random false positive detection occurring within a precise and restricted RT range was calculated as 4% (see Methods). Accordingly, it is highly likely that such false positive is actually a true positive and reveals an error in the (noisy-labelled) ground truth, i.e., a VOC occurrence missed in the expert-led analysis. False positives with a compatible RT value (i.e., characteristic for that specific VOC detected) were named tentative true positives (TTP), whereas false positives that are not TTP are called certain false positives (FP); see Fig 6. In some cases, the VOCs detected by the system and identified as TTP were, in fact, reported by the ground truth, but at different RT positions (Fig 6). Accepting the high chance of tentative true positives to be true positives and the constraints that a VOC appears only once in a sample, the corresponding false negatives at the RT positions reported by the expert were identified as tentative true negatives (TTN). Note that only FN with respective TTP were verified; in fact, more examples identified as FN may be correctly not detected by the system. False negatives that were not TTN were called semi-certain false negatives (FN). Further details on the concepts of tentative true and false positives, and certain and semi-certain false positives are provided in the Methods. Table 3 reports the performance achieved on the testing set by the system with different CNN models. Additionally, the result intersection, i.e. detections consistent (in terms of label and RT) among all the models, was evaluated. Results are presented in two forms: according to the expert-derived ground truth and according to the system-derived corrections. The total quantities of TP, TTP, FP (certain), FN (semi-certain) and TTN across all samples are given. We report the system sensitivity and mean average precision score (mAP), which is an evaluation metric commonly used in the object detection domain [38] (see Method). Fig 6 shows a graphical example, in which TP, TTP, FN and TTN were identified.
(2)-VOC presence. To compute the system's specificity, we considered only the question of the presence of each VOC in each sample and ignore its RT position. Since each VOC may appear in a sample at most once, such an approach leads to a binary classification problem. True negatives (TN)were thus well defined here as VOCs not detected in a sample by both the system and the expert-led processing. False positives were defined as VOCs detected by the system and not reported by the ground truth. In particular, this definition is different than System-derived ground truth corrections. Similarly as above, false positives reported in their respective RT ranges were identified as tentative true positives (TTP � ), false positives that were not TTP were called certain false positives (FP � ). Table 4 reports the system specificity in the analysis of the 38 clinical samples from the testing set. Results are presented according to the expert-derived ground truth and according to the system-derived corrections. The total quantities of TN, FP � (certain), TTP � across all samples are given along with the system specificity. The full tables of detections for each CNN model and their intersection, presented per each target VOC and per each testing sample, are reported in S4.1-7.2 Tables in S1 File.

Discussion
The results show that all tested system configurations, with various CNN models employed, achieved high performance in the detection of target VOCs in the clinical samples, reporting high sensitivity and specificity (when admitting the noisy-labelled ground truth). The results demonstrate that ion patterns can be effectively learnt directly from the raw GC-MS data. The

PLOS ONE
problem of VOC detection includes various challenges such as detecting VOCs of low intensities and distinguishing overlapping VOCs and those with similar ion patterns (Table 1). Nevertheless, the system performed comparably well for all 30 VOCs from the dataset (S4.1, S5.1, S6.1 Tables in S1 File). Consequently, we claim that the CNN-based system proposed here allows bypassing time-consuming and labour-intensive expert-led data processing in realworld GC-MS data targeted analysis. We found strong evidence thatthe proposed CNN-based system may outperform human experts. In our tests, the system detected 17% to 23% more occurrences of the VOCs than expert-led deconvolution-based method (TTP, Table 3). As much as 138 TTP were reported by each of the tested models. Additionally, the system did reveal possible errors, i.e., incorrectly labelled VOCs (TTN). There are two possible reasons to explain the CNN-based system outperformance over the current expert-led processing. The current method involves complex preprocessing steps, based on the operator-subjective spectral deconvolution [17]. The proposed system instead analyses raw GC-MS data and thus bypasses multiple steps and possibly suboptimal choices of parameters. Another reason why the proposed workflow delivers more VOC detections may be its potential to detect compounds of low intensities. Additional analysis of the results (S1, S3.1, S4.1, S5.1 Tables in S1 File) showed indeed that over 50% of TTP (for VGG-8-1D; over 40% for rest of the models) were reported by 25% of target VOCs with the lowest average concentration (Table 1). Most TTP, 27, were reported for Propanoic acid (VOC 6), a compound of low concentration. Interestingly, before augmentation, the VOC dataset had only 22 training examples of VOC 6. Despite that, the CNN-based system was able to detect this compound correctly: none FP and FN reported, all 27 TTP reported by all the models. In our tests, the proposed system improves on the state-of-the-art performance of the current deconvolution-based process.
The proposed CNN-based system requires expert knowledge to be trained (stage 1), but consequently, it can detect VOCs autonomously and significantly faster than a human-driven procedure. The training stage in the proposed approach, requiring about 23 to 80 hours (depending on the CNN model), has to be performed only once to be able to scan new samples at a rate of as low as just around 2 minutes per sample (see Table 5 in Methods). Interestingly, the fastest among the tested architectures, i.e., VGG-8-1D, is also the one with the highest sensitivity. Consequently, the proposed system may support experts in much faster validation of hypotheses regarding compounds related to specific health conditions on new GC-MS samples. What is more, those hypotheses may be more accurate due to the system's ability to provide a more comprehensive list of VOCs than the deconvolution-based methods.
Very deep networks are often thought to have more discrimination capabilities than shallower ones, but in this application, the VGG-like networks with reduced depth performed no worse than other very deep models ( Table 2). This suggests that the detection and classification of VOC ion patterns can be effectively performed with modest-depth networks, which are also less resource-intensive. All tested networks achieved higher performance when implemented with 1D filters, adapted to the nature of GC-MS data.
The proposed CNN-based approach provides specific information on individual VOCs in breath for further analysis and diagnosis, rather than a high-level sample classification (e.g., [4,7,21]). As opposed to black-box and end-to-end machine learning diagnostic systems, the proposed system quickly produces accurate lists of VOCs, thus enabling a transparent and explainable pipeline for breathomics-based diagnosis. This approach avoids the common problem of having limited datasets of clinical samples: each of many target VOC occurrences in a sample makes a data point; in this study, 82 clinical samples and 30 target VOCs resulted in a dataset with 3,736 unique data points and 373,600 augmented data points.
The proposed system does not use RT values, which are a major source of variation among measurements with different instruments (or even the same instrument but on different days). Instead, the proposed system analyses the patterns of the VOCs, which remain substantially the same when are processed with similar instruments, in terms of type and properties such as resolution [29]. Therefore, the CNN-based system may be transferable to data obtained from another GC-MS instrument than the one from which the training dataset originates. Further tests are required to assess such proprietary precisely.
The proposed approach exploits GC-MS general properties and thus potentially extends beyond breath data. Further studies may extend tests to GC-MS data from a large variety of domains, e.g.: detection of CBRN (chemical, biological, radiological, nuclear) biomarkers in breath, saliva and skin for casualty triage [39]; tracking organic pollutants in water for environment monitoring [40]; detection of accelerants in fire debris for criminal forensics [41]; detecting drug ingredients in urine samples for law enforcement or sports anti-doping analysis [42,43]; analysis of chemical composition of the planets' atmosphere in astrochemistry [44]; as well as in chemical engineering [45], food, beverage and perfume analysis [46][47][48] and medicine [8].
The proposed new approach to GC-MS data analysis, exploiting the application of deep learning, has the potential for extensive development in the future. Increasing breath analysis as a diagnostic technology will also increase the number of available GC-MS datasets: a larger number of VOC patterns, reflecting more of the possible variations in the data points, may benefit the accuracy of deep neural network training. The use of GPU computing and dedicated hardware can help process the large amount of data collected through GC-MS; additionally, its rapid development, seen in recent years, may reduce the processing time even further.
Future studies can extend the proposed CNN-based approach to also measure VOC intensities. The proposed system is currently limited to detecting the presence of VOCs of interest, but not their abundances. However, in real life scenarios, additional analysis by experts to determine peaks intensities and then concentrations of particular VOCs, e.g., biomarkers related to a specific disease, may become necessary only if the system reveals their presence in a sample. In fact, as a result, the proposed workflow delivers along with a list of detected VOCs their RT positions in the sample, which can significantly facilitate and accelerate the quantification of the compounds.
In summary, the proposed CNN-based system delivers a faster, more accurate and scalable method for automated targeted analysis of raw GC-MS data than the current state-of-the-art expert-led processing. The proposed approach has a significant potential to contribute to the development of breath analysis as a diagnostic platform to detect various diseases quickly, efficiently, and reliably.

Ethical approval
The study was approved by the South East Scotland Research Ethics Committee 01 (16/SS/ 0059) and all clinical staff were trained, and proficiency tested for breath analysis prior to the start of patient recruitment. Informed and written consent was given by all participants.

Breath sample collection
Breath samples were collected from 25 participants before and after radiotherapy at 1, 3, and 6 hr. A Respiration Collector for In Vitro Analysis ReCIVA™ device (Owlstone Medical, Cambridge, UK) was used for the breath samples collection. Clean air was provided from room-air filtered with an activated-carbon scrubber and HEPA filter, at a flow of 35 dm 3 min −1 . (The air-supply unit was built and tested by the Centre for Analytical Science, Loughborough, UK). The total sample volume of breath was set to 1000 cm 3 with a sampling duration cap of 900 s. Tenax 1 /Carbotrap 1TD hydrophobic adsorbent tube (Markes International Ltd, Llantrisant, UK) was used. All materials were conditioned and sterilised before use to reduce exogenous VOC artefacts. clinical staff were trained and proficiency-tested prior to clinical breath sampling. Environmental and air-supply samples were collected with each set of breath samples. Samples were sealed and stored at ca. 4˚C immediately after collection and transported to Loughborough Centre for Analytical Science within 48 hr [49]. Samples were dry-purged on receipt with a 120 cm 3 of purified nitrogen at a flow rate of 60 cm 3 min −1 . Toluene-D8 (0.069 ng) and trichloromethane-d (0.28 ng) internal standards were spiked into the sample during the dry-purge process using a six-port valve. All samples were then sealed and stored at 4˚C prior to analysis.

GC-MS processing of clinical samples
Thermal desorption (Unity-2, Markes International) interfaced to a GC (Agilent, 7890A) coupled to a quadrupole mass spectrometer (Agilent, MS 5977A) was used for the analysis of all clinical samples, see S9 Table in S1 File for operating details. The ion channels from 40 m/z to 450 m/z were measured with unit resolution. The instrumental scanning rate was approximately 6.25 Hz, which for about one-hour processing gave approximately 22500 RT points. The size of the derived abundance matrix of each sample is R × 411, where R � 22500.
The samples were analysed over a year period (Sep 2016-Sep 2017) as part of a wider multi-centre clinical study campaign. Instruments were serviced whenever statistical process controls indicated a z-score >3 or more than 3 consecutive z-scores >2. The frequency of the service interventions was determined by the quality and contamination levels of the samples returned from the clinics. The column was replaced once during this phase of the campaign, in March 2017.

Abundance matrix
For each sample, GC-MS processing produces an abundance matrix A 2 R R�411 (see S2 Fig in S1 File for an example). Let z i 2 R 411 , i = 1, . . ., R be mass spectrum (i.e., intensities across consecutive ion channels derived by MS) at particular retention time points r i , i.e.: For each ion channel z i its corresponding RT point r i is specified by the function RT :

Expert-led GC-MS data processing
GC-MS data denoising and baseline correction and feature deconvolution were carried out (AnalyzerPro Spectral Works, UK) by a highly-qualified and experienced chemical analyst. As a result, 350 to 500 VOC features per sample were recorded. Features were aligned to correct for retention time variation, and the VOCCluster algorithm [15] was used to cluster all features into groups. Each feature was assigned an identifier in the format: BRI-m/z 1 m/z 2 . . . m/z n . BRI indicated the retention index for the VOC breath feature and m/z 1 . . .m/z n are the nominal masses of the ion-fragments in decreasing order of abundance needed to uniquely define the feature. The processing time of a single breath sample is estimated as 60 to 120 minutes.
To produce the ground truth for this study, this process delivered a list of VOCs with Level 2 chemical identification [50] and their corresponding positions in the matrix of raw data, this is startRT, peakRT (denoted also as RT) and endRT-the indexes along the retention time where the compound was measured to start, peak and end the release from the GC column.
The process was not completed for 5 breath samples for technical issues, resulting in 120 expert-processed GC-MS samples.

VOC dataset structure
The VOC dataset was extracted from the abundance matrices of the 82 samples from the training set, using the ground truth for the 30 target VOCs. A VOC data pointŝ has a size 411 along the m/z dimension, which ensures inclusion of the entire mass spectrum. The size along RT dimension, δ, was computed with the aim to capture the entire elution process for each target VOC instance. Precisely, the maximum duration of compound elution, max(endRT − startRT), was measured across 1868 occurrences of target VOCs reported by expert led-processing in training samples. This was computed as �9.8 seconds, corresponding to �61 time steps on the RT axis. To allow for translation-based data augmentation, the size δ was increased by 19, resulting in a final δ = 80. Each data pointŝ 2 R d�411 contains mass spectra of a VOC peak centred over RT dimension, i.e.: -a middle point of the VOC peak shape. Note that μ does not necessarily coincide with peakRT value as VOC peaks can be not symmetric but skewed. l value depends on the VOC instance.
The VOC dataset for CNN training was unbalanced: the groups representing target VOCs were unequal since each of the considered VOCs did not necessarily occur in each of the GC-MS samples from the training set. What is more, as during the scanning of entire raw breath samples the negative examples appear much often than target VOCs and cover a broad variety of ion patterns, the negative class was created as half of the VOC dataset. This aligns with common practice in object detection domain, where the negative class usually dominates with the ratio up to 3:1 [51, 52]. The size of each class is reported in S1 Table in S1 File.

Data augmentation
In the proposed approach, we devised and tested two augmentation methods that maintain the underlying structure of a VOC pattern.
Translation along RT. The VOC data pointŝ is centred on the VOC peak so that the middle point of the VOC peak shape, m ¼ startRTþendRT 2 , is located at δ/2. Twenty VOC data points were created fromŝ by shifting the extraction point from the abundance matrix A from -9 to +10 steps along the RT axis (0 indicatesŝ), i.e.: Such data augmentation presents the VOC pattern at different positions in the data point: top, centre, bottom (translation along RT) and represents variations in the pattern's location in the subsequent sub-matrices of A seen by the network, while scanning entire raw GC-MS breath samples.
Intensity variation. In the VOC data point, the intensities along RT interval corresponding to the VOC pattern were varied to simulate variations of the VOC concentration (see Fig  7). Background (i.e., values along RT not containing the pattern) remains unchanged. In this process, it was important to maintain the VOC ion pattern, i.e., relative ratios along the m/z dimension while increasing peak intensity along the RT axis. This was achieved by multiplication of each mass spectrum z startRT , . . ., z endRT (corresponding to VOC location) by a particular value of a Gaussian-shaped function along RT interval. Precisely, for a data pointŝ: : the augmented data pointŝ � can be computed aŝ where and r is a random value in the range (0, 0.1). This augmentation step was repeated to obtain 4 additional data points for each data point derived with translation along RT. Therefore, the VOC dataset for CNN training was augmented 100 times (i.e., 20 × 5 times). The intensity values in each data point from the augmented dataset were normalised in the range [0, 1]. The original VOC dataset had 3,736 data points. Following translation along RT the dataset for CNN training consisted of 74,720 segments (partially-augmented dataset). When intensity variation was also applied, the dataset contained 373,600 data points (fully-augmented dataset). Results presented in the Table 2 indicate that both developed augmentation methods bring benefits to the system: the best performance was achieved on the fully-augmented dataset.

CNN filter adaptation
To adjust the deep learning models specifically to VOC detection, we adapt the CNN filters to the GC-MS data characteristics. In CNNs, filters specify the local receptive fields, i.e., the regions of the image (or feature maps) visible by convolutional and pooling layers of the network at a time; this is an essential concept that enables CNNs to capture local geometric spatial correlations in the data [22]. With GC-MS data, such a local correlation occurs only in the retention time dimension. Along this dimension, the abundance of different m/z increases and decreases depicting peaks as the VOC exit the GC column. On the other hand, the abundance values across different m/z channels also correlate as the particular ions make up the VOC pattern (Fig 1a and 1b). However the values along this dimension represent independent ion channels and locally they are only weakly correlated, thus their correlation cannot be captured by small local filters.
One of the hypotheses in this study is that the convolutional and pooling layers in the network do not need to be two-dimensional as it is usual for image classification. Thus, two types of filters are tested: a traditional two-dimensional filter and a specific one-dimensional filter along the RT axis to cover only this dimension. The filters sizes in the particular network layers are given, along with the detailed architecture of each network, in S3.1-S3.11 Tables in S1 File.

Phase A output
The last layer in all the tested networks is a fully connected layer with softmax activation. The softmax function s : R k ! R k , defined as As a result, for each analysed raw GC-MS sample with abundance matrix A 2 R R�411 , the CNN scanning in phase A produces two sequences L A and T A . L A values indicate classification labels of the subsequent sub-matrices s of A: T A values indicate classification confidence given by the network for each respective label: where N = RT−δ + 1 � 22500 is a number of data points in A.

Duration rule: VOC detection
Each maximal sub-sequence of L A with constant label values j and length equal or greater than a constant γ indicates one detection d j of the VOC j, i.e.: where cðs i Þ ¼ j; i 2 ½l; l þ n À 1�; cðs lÀ 1 Þ 6 ¼ j; cðs lþn Þ 6 ¼ j; n � g; j > 0: The value γ was derived from the width setting of matrix s: for the augmentation, width of s was enlarged by 19 pixels in respect to the maximum peak width (see Data augmentation). Hence, during the scanning of the abundance matrix A, the entire shape of a target VOC peak is seen by the model at least γ = 20 times. All the detections, of any class, from the abundance matrix A (i.e., sample) constitutes the set D A .

RT of detection
Let the retention time value corresponding to the specific data point s = [z k , . . ., z k+δ ] be defined as the retention time value of its middle ion channel, i.e.: Retention time values corresponding to the first and last data points s i of detection d j are called respectively the start detection, sRT ðd j Þ, and end detection, eRT ðd j Þ: We denote the resulting detection interval by DIðd j Þ ¼ ½sRT ðd j Þ; eRT ðd j Þ�:

Order rule
The elution of particular VOCs from the GC column can be expected in a certain order (small variations may occur)related to compounds' chemical properties. Therefore, to derive a reliable list of detected VOCs, the detections that fall outside the expected order are removed.
Let � be a linear order on a set D A of detections, such that

Detection confidence
For each detection d j , the detection confidence T ðd j Þ was calculated as the maximum value of the moving average, with size γ = 20, of the classification confidence values t(s) for the consecutive data points s from d j , i.e.: The value γ = 20 was derived as explained for the duration rule above.

Uniqueness rule
Since each VOC may be present in a GC-MS sample at most once, if multiple detections of one VOC occur, d 1 j ; . . . ; d K j j , the one with the highest detection confidence value is kept in the set of detections: fT ðd k j Þ j k ¼ 1; . . . ; K j g; . . . ; K j ; j ¼ 1; . . . ; 30g:

Phase B output
As an output, phase B produces a set L of detections d j 2D A as 4-tuples of label, start detection, end detection and detection confidence:

RT range of a VOC
Each target VOC in the GC-MS dataset is narrowly distributed over a specific range of the RT dimension. From the processed data files, the RT range of occurrences was extracted for each target VOC j across all its instances in the GC-MS dataset reported by expert-led processing, i.e.: RTrange j ¼ ½minfpeakRT j g; maxfpeakRT j g�: What is important, such RT ranges depend on the GC column. During the course of our study, the GC column was changed and thus we extracted two RT ranges for each VOC, RTrange 1 j and RTrange 2 j , valid respectively before and after column change. S1 Table in S1 File presents RT ranges for each target VOC.

Tentative true positives
Letd j be the detections in the set D A that were not reported in the ground truth at the RT position reported by the system for that sample A. Note that we consider the set D A , i.e., before reducing the number of detections for the order and uniqueness rules, as their application changes the detection distribution. We observed that most (60% to 74% depending on the network) detectionsd j 2 S A D A were reported by the system in the RT range of the j VOC (accordingly before or after column change, i = 1 or 2), i.e.: DIðd j Þ \ RTrange i j 6 ¼ ;; Let I denote an interval of length chosen uniformly at random from within the RT dimension (of length R). Then the probability of the intersection of the interval I with a given RTrange i j can be computed as: where the maxd j 2 S A D A depends on the network used. P max is the upper bound probability of a random (false) detectiond j 2 D A of a target VOC j at specific RT range in the sample A. We computed the following bounds: The probabilities P max are all very low in comparison to the actual proportion of such considered detections (i.e., 60% to 74% depending on the network). Therefore, it is highly likely that the detectionsd j within a compatible RT range are not random false detections, but are correctly detected VOCs. Such detections are named tentative true positives (TTP).

Tentative true negatives
Several VOCs, not detected at the specific RT points reported by the ground truth (preliminary marked as false negatives), were detected as tentative true positives on the different RT positions within their compatible RT range. Because of the high chance of tentative true positives to be true positives, the noisy-labelled character of the ground truth and the constraints that a VOC appears only once in a sample, we conclude that such particular examples may indicate errors in the expert-led processing. Such examples are called tentative true negatives.

Average precision score
The system performance of each network was assessed by an average precision score (AP) for each target VOC and mean average precision (mAP). For each target VOC j, a list of all system detections d j 2 S ADA of the VOC in all clinical samples from the testing set was derived from stage 2 output and sorted in descending order associated with the detection confidence scores T ðd j Þ (for Intersection of the models, it was sorted by mean value of detection confidence scores for tested models). For the first n elements of this list, the precision function Precision (n) was defined as the proportion of TP. The recall function Recall(n) was defined as the proportion of all detections in the ground truth that appear in the first n elements of the system detection list. As usual, the AP score for each target VOC was calculated as the integral under the graph of precision against recall (S4.1-S7.30 Figs in S1 File). mAP was calculated as the average of AP values across all target VOCs. AP values for each target VOC for each tested model, along with the respective graphs of precision vs recall functions, are given in the S1 File.

Sensitivity and specificity
The system's sensitivity was computed for the expert benchmark (i.e., TTP are considered FP, TTN are considered FN) as sensitivity ¼ TP TP þ FN þ TTN ; and for the system-derived correction benchmark (i.e., TTP are considered TP, TTN are considered TN) as sensitivity ¼ TP þ TTP TP þ TTP þ FN : The system specificity was computed for the expert benchmark (i.e., TTP are considered FP) as specificity ¼ TN TN þ FP � þ TTP � ; and for the system-derived correction benchmark (i.e., TTP are considered TP) as specificity ¼ TN TN þ FP � :

Resources
All experiments were run on a server running Linux Ubuntu with 20 cores, 128GB RAM and NVIDIA Tesla K80 GPU cards. Table 5