Evaluation of data processing pipelines on real-world electronic health records data for the purpose of measuring patient similarity

doi:10.1371/journal.pone.0287264

Fig 1.

Network architecture for the two types of AE used in the analysis.

(a) Single hidden layer and (b) Two-hidden layers. In both cases, x_i represent elements of the input as well as the output vector. z_i represents elements of the learned lower dimensional representation. (a) Shallow autoencoder architecture and (b) Deeper autoencoder architecture.

More »

Expand

Fig 2.

Flow diagram of the MCA data analysis pipeline.

MCA on all features after numerical feature categorisation.

More »

Expand

Fig 3.

Flow diagram of the MCA/PCA data analysis pipeline.

MCA on categorical and PCA on numerical features.

More »

Expand

Fig 4.

Flow diagram of the MCA/PCA/PCA data analysis pipeline.

MCA on categorical and PCA on numerical features followed by PCA on resulting components of both methods.

More »

Expand

Fig 5.

Flow diagram of the AE data analysis pipeline.

Feature scaling followed by application multiple layer autoencoder neural network.

More »

Expand

Table 1.

Characteristics of the COPD cohort (overall, training and test sets) used for measuring patient similarity.

Mean value and standard deviation are presented for continuous feature.

More »

Expand

Table 2.

Features ranked by degree of influence on the resulting similarity according to each pipeline.

Bolded entries indicate numerical features.

More »

Expand

Table 3.

Clustering results comparison for COPD cohort—different k values presented above and below diagonal as indicated by backslash “\”.

The values of the diagonals represent averaged results obtained from 10% bootstrapped sampling and re-clustering.

More »

Expand

Fig 6.

Patient similarity rankings as assigned by two clinician raters for each pipeline.

More »

Expand

Table 4.

Summary of evaluation results, including importance of features, cluster tendency and clinical expert evaluation for all four data processing pipelines.

More »

Expand

Table 5.

Rater agreement metrics.

Raw (%) agreement and kappa coefficient calculated on the basis of 1–2 as well as binary (best/worst) rankings.

More »

Expand