Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine

doi:10.1371/journal.pcbi.1009014

Fig 1.

Panel A: The Galaxy-ML platform provides all the tools necessary to define a learner, train it, evaluate it, and visualize its performance. Panel B is a screenshot of the Galaxy tool to create a gradient boosted classifier. Panel C shows a Galaxy workflow to create a learner using a pipeline, perform hyperparameter search, and visualize the results.

More »

Expand

Table 1.

Software libraries integrated into Galaxy-ML and their applications.

More »

Expand

Fig 2.

Pairwise performance comparisons for use cases 1 and 2.

Use case 1 pairwise comparisons for classification tasks on 164 structured biomedical datasets [25] show decision tree forests perform best (panel A) and hyperparameter optimization can improve the performance of most models (panel B). Use case 2 results for prediction using regression (panel C) and classification (panel D) show ensemble approaches that use stacking perform best, though linear-based gradient boosting also performs. In panels A, C, and D, heatmaps show the percentage of datasets for which the model listed along the row outperforms the model along the column. For instance, in panel A, XGBoost outperforms Gradient Tree Boosting (GTB) from scikit-learn on 38% of datasets, GTB outperforms XGBoost on 11% of datasets, and they perform equivalently on 51% of datasets.

More »

Expand

Fig 3.

(A) Galaxy workflow to create and train a deep learning model, then use the model for visualization and prediction. (B) Precision-recall curve for a deep neural network trained to predict binding sites for a single transcription factor. (C) Precision-recall curves for a deep neural network that predicts 919 regulatory element binding profiles, with each curve in the plot denoting a precision-recall curve for 1 regulatory element.

More »

Expand