An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes

doi:10.1371/journal.pone.0204425

Table 1.

Demographics of the patients from this study.

More »

Expand

Fig 1.

The general workflow of classifying FAIMS data into diseased or non-diseased classes.

The steps that were explored are indicated as dark blue boxes. Variations or specification of some steps are displayed at the sides. The order in which the steps and approaches were investigated differs from the order shown in the diagram. Consult the main text for a description of the order. Briefly, the pipeline was compared when using the data of different sample “runs” either individually or in ensembles. Different forms of discrete wavelet transforms (DWT) were considered, as well as a feature exclusion step based on the feature variance. Within the cross–validation cycle, we evaluated three different feature selection methods (filter, wrapper and embedded), as well as a post–filter selection principal component analysis (PCA) step and the inclusion of the demographic data as features. Finally, we also explored ensemble steps at the classifier model probability level. See main text for details and the order in which the pipeline was explored.

More »

Expand

Fig 2.

The recommended pipeline for classifying FAIMS data into diseased or non-diseased classes resulting from this study.

We found that “run” 2 data with a 2D wavelet transform were the better performing steps prior to the feature selection. The filter method with an nKeep parameter value of 2 perform best and with minimal algorithm run time. The addition of the demographic data as features to the wavelet transform FAIMS data resulted in a higher AUC score, although it was not found to be a statistically significant finding. However, these data might prove informative in a larger-scale pilot analysis. Overall, no classifier model was found to out–compete the others and we therefore suggest to use all five, until further research determines a “clear winner”. See main text for details and discussion about our findings.

More »

Expand

Fig 3.

Data visualisation.

(a) Heat map of FAIMS data for a diabetic patient. (b) Linearised data without wavelet transform. (c) Data with one–dimensional (1D) discrete wavelet transform (DWT). (d-f) show the equivalent plots for a member of the control group (volunteer).

More »

Expand

Table 2.

Model performance comparison with the use of different runs.

More »

Expand

Table 3.

Model performance comparison of use of raw FAIMS data and wavelet-transformed FAIMS data.

More »

Expand

Table 4.

Model performance comparison using different of 2D wavelet transforms.

More »

Expand

Fig 4.

Classification model performance for each model across a range of nKeep values.

Error bars show the 95% confidence intervals. Neural Network cannot be used with more than 11 features.

More »

Expand

Table 5.

Model performance comparison of PCA implementation.

More »

Expand

Table 6.

Feature selection method comparisons.

More »

Expand

Table 7.

Feature selection method comparison.

More »

Expand

Table 8.

Model performance comparison run subtraction.

More »

Expand

Table 9.

Model performance comparison- noise reduction approaches.

More »

Expand

Table 10.

Model performance comparison when using the demographic (demo) variables as features or when using these in addition to the two FAIMS features selected by the filter method.

More »

Expand