A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Table 1 — Table 1.

Pathway categories and their inclusion.

More »

Fig 1 — Fig 1.

The first 3 lines of the KEGG-SMILES dataset.
The KEGG-SMILES dataset, as created by Baranwal et al, was a tab-separated text file with the first column containing the SMILES representation of each metabolite and the second column containing the numeric identifier (0 to 10 inclusive) of each pathway category, the category identifiers being comma-separated.

More »

Table 2 — Table 2.

Availability of code and data for past publications.

More »

Table 3 — Table 3.

Reported model performance of past publications.

More »

Table 4 — Table 4.

Examples of duplicate instances.

More »

Table 5 — Table 5.

Dataset statistics for the original dataset compared to the de-duplicated dataset.

More »

Table 6 — Table 6.

Counts Of unique entries according to number of occurrences and number of pathway labels.

More »

Table 7 — Table 7.

Unique entry occurrence compared to label count.

More »

Table 8 — Table 8.

Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset.

More »

Table 9 — Table 9.

Results of statistical tests.

More »

Table 10 — Table 10.

Model performance per dataset.

More »