Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset

doi:10.1371/journal.pone.0283094

Table 1.

Descriptive statistics of the study population, grouped by age.

Only age, sex, height and weight were ultimately used in the machine learning models as predictive variables. All variables were, however, used for imputing missing data and constructing synthetic datasets.

More »

Expand

Fig 1.

Histogram comparison for each variable comparing the aggregate demographic characteristics of the real training dataset (n = 2408) against synthetic dataset A (n = 2408).

More »

Expand

Fig 2.

Histogram comparison for each variable comparing the aggregate demographic characteristics of the real training dataset (n = 2408) against synthetic dataset B (n = 4816).

More »

Expand

Table 2.

Statistical analysis comparing synthetic data tables to the real training dataset (n = 2408).

Presented are propensity score mean-squared-error and standardised ration of propensity score mean-squared error.

More »

Expand

Table 3.

Results of the machine learning models, trained on real or synthetic datasets.

Each was tested on the same test dataset (real data). None of the p-values were <0.05.

More »

Expand