Designing machine learning workflows with an application to topological data analysis

doi:10.1371/journal.pone.0225577

Table 1.

Notation for sets, spaces, functions, etc used throughout the paper.

More »

Expand

Table 2.

Common data operations expressed as MLMs.

More »

Expand

Fig 1.

Block diagram of Eq 29, showing how a workflow is created for each node.

The first step is one-hot encoding the data to embed it into . The next step computes the Mapper graph of the data. Then models are trained on each node, and summed. Finally, a decision function outputs the final class prediction.

More »

Expand

Table 3.

Descriptive table of patient data from Barnes Jewish Hospital.

More »

Expand

Table 4.

Results for different workflows of logistic regression on hospital readmissions data, with standard deviations over n = 10 runs.

More »

Expand

Table 5.

Results for different workflows of SVMs for hospital readmissions data, with standard deviations over n = 10 runs.

More »

Expand

Table 6.

Results for different workflows of random forests for hospital readmissions data, with standard deviations over n = 10 runs.

More »

Expand

Table 7.

Results for different workflows of Adaboost classifiers, with standard deviations over n = 10 runs.

More »

Expand

Fig 2.

Typical Mapper graph generated from hospital readmissions data.

The nodes are colored showing level of readmissions, and larger node size indicates a higher number of patients in that node.

More »

Expand

Fig 3.

Typical Mapper Graph generated from first principal component of German Credit Data.

Nodes are colored to show the levels of bad credit, and sized by number of data points.

More »

Expand

Table 8.

Descriptive table of German Credit Dataset from UCI Repository, monetary values in units of Deutsch Marks.

More »

Expand

Table 9.

Results for different workflows of logistic regression classifiers on German Credit Data, with standard deviations over n = 10 runs.

More »

Expand

Table 10.

Results for different workflows of SVM classifiers on German Credit Data, with standard deviations over n = 10 runs.

More »

Expand

Table 11.

Results for different workflows of random forest classifiers on German Credit Data, with standard deviations over n = 10 runs.

More »

Expand