Designing machine learning workflows with an application to topological data analysis

In this paper we define the concept of the Machine Learning Morphism (MLM) as a fundamental building block to express operations performed in machine learning such as data preprocessing, feature extraction, and model training. Inspired by statistical learning, MLMs are morphisms whose parameters are minimized via a risk function. We explore operations such as composition of MLMs and when sets of MLMs form a vector space. These operations are used to build a machine learning workflow from data preprocessing to final task completion. We examine the Mapper Algorithm from Topological Data Analysis as an MLM, and build several workflows for binary classification incorporating Mapper on Hospital Readmissions and Credit Evaluation datasets. The advantage of this framework lies in the ability to easily build, organize, and compare multiple workflows, and allows joint optimization of parameters across multiple steps in an application.


Introduction
In this paper, we leverage the concept of morphisms to construct a general structure that represents all type of data operations including data preparation, feature extraction, and mapping to outcome of interest. We call these fundamental building blocks Machine Learning Morphisms (MLMs). We describe an entire workflow, from data preparation to final task evaluation, in a single equation featuring a composition of MLMs. This allows data scientists to easily keep track of the workflow, and provides a modular framework for workflow construction and comparison. Our framework was inspired by the minimization framework in statistical learning theory [1] and the composition of functions used by artificial neural networks [2]. The concept of morphisms was referenced in [3] as a mapping between human behavior and machines. However, to the best of our knowledge, expressing a sequence of data operations as a single composition of morphims is a novel construction. This construction allows us to easily build and compare workflows across a data set (or collection of datasets), clearly present and interpret the end results, and gives interesting possibilities for joint optimization over the paramters of every step in a workflow. PLOS  To illustrate this framework, we examine the specific case of binary classification, with imbalanced classes and hybrid continuous-categorical data. We examine the use of the Topological Data Analysis (TDA) algorithm Mapper [4] as an MLM, and build a workflow using Mapper to train an ensemble of classifiers. We compare workflows using Mapper to other workflows in numerical experiments on two real world datasets. Further, Mapper is generally used for feature selection and qualitative visual analysis, so our approach represents an innovative use of the algorithm.
The rest of this paper is organized as follows. Section II introduces the MLM, and defines several operations. In Section IV, we build and compare several workflows on two datasets, and in Section V we present some final thoughts and directions for future work. Table 1 gives the general form for notation used for spaces, functions, and other structures used in the paper.
2 Mathematical nomenclature for machine learning workflows

Statistical learning overview
To provide motivation and set up the class of problems we are interested in, we present a basic overview of the supervised statistical learning problem. In this problem, measurements, observations, or data are gathered with the goal of constructing a mathematical relationship between the data and an outcome of interest. Miescke and Liese [5] define the sample space as a pair ðS; UÞ where S ¼ X � Y is a space where the data X and outcomes Y can be observed, and U is a σ−algebra containing all subsets of S relevant to the problem. Elements s 2 S are the realizations of a random variable S for some underlying probability space (O, F, P O ).
Here P O is a probability measure on a set O and σ-algebra F, and the mapping S : Ω ! S is measurable [5]. Let the set of n realizations of S be denoted as S 2 S � S � � � � � S, where we call S � S � � � � � S the statistical space. The elements of S are data points or samples.
We will denote the realizations of the outcome of interest as Y 2 Y � Y � � � � � Y ¼ Y n , and we will call the realizations of the features as X 2 X � X � � � � � X ¼ X n . In [1], Vapnik et al. give three criterion for a supervised learning problem. The first is the underlying distribution of the features, P(X = x). Next is the unknown conditional distribution P(Y = y|X = x). Finally, given a parameter space Θ with elements p, denote the function used to approximate P(Y = y|X = x) as a learning machine. The parameters p are learned using a loss function as a measure of discrepancy between the predictions F(x;p) and the observed values y. where P(X = x, Y = y) = P(Y = y|X = x)P(X = x) is the potentially unknown probability measure on X and Y. This represents the expected value of the lost function across all realizations, and the optimal parameters are defined as: Rðp; X; Y; FÞ ð4Þ Statistical learning theory has a rich history, and the properties of learning machines are widely studied and researched [6]. For example, Linear Regression [7], Support Vector Machines (SVMs) [8], and Neural Networks [9] all use this optimization framework with different choices of learning machine, loss, and risk functions to train their parameters. In practice the parameters are learned by approximating Eq 3 with an empirical risk function [10] defined on the realizations: Lðy i ; Fðx i ; pÞÞ ð5Þ and the optimal parameters are given by: Empirical risk is a valid approximation for the expected risk when Eq 5 converges to Eq 3 in probability as n goes to 1. The conditions for this convergence are discussed thoroughly in [10], [7], and [1]. Learning workflows consist of a sequence of operations acting on the realizations in the statistical space. We distinguish two types of operations in these workflows. The first set consists of processes such as standardization and sampling, which help guarantee the convergence of the empirical risk function to the continuous risk function defined in Eq 3. The second set of operations learn the parameters of the learning machine F, which can itself be defined as a composition of operations acting on the sample space. These operations are often separated into stages such as preprocessing, feature extraction/selection, model training, etc., and it can be difficult to understand what is actually happening to the original data or if one has built the best workflow for the task at hand.
The field of Automated Machine Learning seeks to address this challenge by algorithmically finding the best workflow. Packages such as TPOT [11] and auto-sklearn [12] create wrappers around the popular scikit-learn package in python, while Auto-WEKA [13] performs hyperparameter optimization over the WEKA platform [14]. TPOT constructs a graphical model of a workflow, and then uses a genetic algorithm to search the space of possible workflows [11], Auto-WEKA uses a bayesian framework to iteratively optimize model hyperparameters. Autosklearn builds off of Auto-WEKA, but also incorporates model performance on past datasets and creates ensemble models out 15 classifiers available in scikit-learn [12].
In our approach, we define a fundamental building block based off the idea of learning machines and risk minimization in statistical learning theory called the Machine Learning Morphism (MLM), that can be used to systematically build and analyze each step in a machine learning application, and keep track of the data at each step in the process. Since some data operations do not necessarily fit the mathematical definition of functions, for example splitting the data into multiple training and hold-out sets for cross-validation, we use morphisms as the building block rather than functions. This approach allows us to define a workflow as a composition of morphisms acting on the sample space, whose parameters are learned using a risk function acting on the statistical space.

The Machine Learning Morphism
Definition 1. Let: • X be an input space, where X is part of a sample space fS; Ug as defined in section 2.1, • Y be an output space, • F : X ! Y a morphism from input to output spaces, with parameters p 2 Θ, and • P(p) a probability distribution on p representing prior knowledge of the parameters • � R : Θ ! R an empirical risk function.
Then the Machine Learning Morphism ML : X ! Y is defined as: where and X � X n and Y � Y n are n realizations from the input and output spaces used to learn the parameters. Note 1: The choice of a morphism with an associated empirical risk function and parameter prior represents a unique building block. We consider two MLMs with the same morphism F but different risk functions and/or priors to be different morphisms.
Note 2: For the purposes of simplicity we assume all of the risk functions used after this point are empirical risk functions acting on the training data rather than the expected risk function, and will refer to empirical risk functions as "risk functions".
Note 3: We present linear regression, standardization, and other data operations as MLMs in Table 2.
The Machine Learning Morphism consists of the morphism F, whose parameters have been optimized over an empricial risk function � R on the set of realizations from the statistical space. The risk function also performs any necessary operations on the data in the statistical space to ensure convergence in probability to the expected value risk function in Eq 3. One example of this is applying sampling techniques such as oversampling, undersampling, SMOTE [15], or ROSE [16] to the training data in a classification task. This helps "learn" a better representation of the training data conditioned on each class.
The morphism F represents an operation acting on data, such as data standardization, feature extraction, or regression, In supervised tasks, we assume that the realizations of Y are matched to corresponding realizations of X. In an unsupervised task or if training outputs are unknown (for example, k-means clustering), the realizations Y = ; are empty.
Consider the example of least squares linear regression. The input space X is R m . Denote X 2 R n�m as the matrix made from n realizations of X. The output space Y is R and denote the n-dimensional vector of all output realizations as Y. We assume no prior knowledge on the parameters of F, so we use the improper prior PðpÞ ¼ 18p 2 R m . The morphism F is defined as: is the dot product of x and p 2 R M , and the empirical risk function R is the sum of squared error across all of the datapoints: where k�k 2 is the standard 2-norm for real-valued vectors. The Machine Learning Morphism is where p � ¼ arg min k Y À Xp k 2 2 ¼ ðX T XÞ À 1 X T Y is the optimal least squares solution, and Continuing the example of linear regression, many times the data matrix X consists of shifted and scaled columns. The process of shifting and scaling is an MLM. The input space is R m , and the output space is R m . The parameter p is a vector containing shifting constants c 2 R m and scaling constants s 2 R m , and again P is the non-informative prior. The morphism where diag(s) is the diagonal matrix whose main diagonal is s. This morphism shifts and scales each element of x. The risk function depends on the type of scaling implemented. For example, the goal of standardization is to transform data into standard Gaussian random variables to ensure that each element of a data point lies on the same scale for comparison, statistical tests, or model training later on in the workflow. For standardization, one choice of empirical risk Decision Tree Set X {y 1 , y 2 , . . ., y k } for finite k splitting criterion Tree Gini Impurity P N j¼1 1 À Information Gain see [21] Standardization Adaboost R m R parameters associated with weak learners P k i¼1 F i ðxÞ Exponential Loss [22]: kY − F(X)k 2 , Cross Entropy: where KL(�) is the KL divergence [17] between the distribution of their shifted and scaled data and the standard multivariate normal distribution, I n is an n × n identity matrix, and � is the Kronecker product. Let � x be the vector of column means of X, andx be the vector of standard deviations of each columns of X. If the data follow a multivariate normal distribution, then the optimal parameters in this case are c ¼ � x and s ¼x. Because we specify a prior distribution on parameters in Definition 1, we can express Bayesian operations as MLMs. Consider the Naive Bayes classifier [18] with input space is a set X. Similarly, the output space is the set Y ¼ fy 1 ; :::; y k g for k < 1. Assuming conditional independence between the input data, the morphism chooses the class which maximizes the posterior distribution, arg max i2f1;:::;kg PðY ¼ y i jX ¼ x; p � Þ, and the empirical risk function is 0-1 loss [18]. The parameter prior P(p) is a distribution on the parameters of the likelihood P(X = x|Y = y; p) and prior P(y; p). Now that we shown examples of three MLMs, it is natural to investigate how they relate to one another. Property 1 establishes that for certain risk functions, learning the parameters for one MLM can be decomposed into learning the parameters of two MLM's with lower dimensional parameter spaces and potentially lower dimensional input spaces. Property 1. First define two separate MLMs with the same output space: ML 1 with input space X 1 , output space Y, parameter prior P(p 1 ), morphism F 1 , risk function � R 1 ; and ML 2 with input space X 2 , output space Y, parameter prior P(p 2 ), morphism F 2 , risk function � R 2 . Let the Θ 1 and Θ 2 represent the respective parameter spaces for p 1 and p 2 .
If the risk function � R 3 has the form a � Proof: In appendix. Examples of closed operations between MLMs when Y ¼ R include addition, multiplication of two MLMs (for example multiplying probability density functions), and division. If the output space is binary, then Boolean logic operations are valid.
In the example of linear regression, the least squares risk function can be decomposed into two lower dimensional least squares risk functions if the training realizations X form an orthogonal matrix. This could be a property of the feature space or the result of applying a transformation such as principal component analysis. To see this decomposition, let ML 3 be the MLM representing linear regression trained over X, with regression coefficients p. Split where p 1 2 R c and p 2 2 R mÀ c and the training matrix into X = [X 1 X 2 ] where X 1 is the first c columns of X, and X 2 contains the remaining columns. Let ML 1 and ML 2 represent linear regressions trained on X 1 and X 2 respectively using empirical risk For ML 3 , the risk function is: which clearly satisfies Property 1. In the derivation (provided in the appendix) we leverage the fact that X T 1 X 2 and other cross terms form a zero matrix due to the orthogonality of X. In this case each regression coefficient can be found by solving it's own independent one dimensional least squares problem, which could be easily parallelized for a very fast regression.
If Y is a vector space, then ensemble models featuring linear combinations of independently trained MLM's can be decomposed into problems with lower dimensional parameter spaces. For example in random forests each tree is independently trained (common risk functions include Gini Impurity or Information gain) on random and possibly overlapping subsets of the feature space using the random subspace method [19]. The parameters of each tree are the splitting criterion, and the output of a random forest is the average response of all trees in the forest, which is used either as a majority vote or class probability. Because the trees are trained independently of each other, we can define a risk function for the random forest as the sum of the Gini Impurity in each tree and use Property 1 to decompose the forest into its individual trees.
We are also interested in sequential operations using MLMs, so we need a notion of composition. For example, we wish to first standardize our data, and then perform linear regression. Definition 2. Let ML 1 and ML 2 be two MLMs such that the output space of ML 1 is the same as the input space of ML 2 . We define the composition ML 3 ¼ ML 2 � ML 1 as a MLM with structure: Output Space : Parameter Prior : PðpÞ ≔ Pðp 1 ; p 2 Þ ð18Þ Empricial Risk Function : � R 2 ðp 2 ; X; Y; Fð�; p 1 Þ; PðpÞÞ ð20Þ and the learned parameters are given by The output morphisms is a composition of F 1 and F 2 follows the general form for composition of morphisms, and reflects the sequence of transformations on data in a machine learning application. The risk function used in this definition optimizes the task over every available parameter. This was motivated by the idea that this risk function is oriented to the task or machine learning goal, so there is an opportunity to improve task performance by jointly optimizing the parameters.
In our example of standardization (ML 1 ) followed by linear regression (ML 2 ), ML 3 ¼ ML 2 � ML 1 has structure: Fðx; ðp; c; sÞÞ ¼ ðx À cÞdiagðsÞ À 1 � p ð24Þ � Rðp; X; Y; Fð�; ðp; c; sÞÞ; Pðp; c; sÞÞ ¼ ð25Þ In linear regression, standardization is extremely common, and MLMs such as PCA (shown in section 3) require standardized input data. However, in some applications it may be of interest to give more weight to one column than another, and exploring non-standard or non-linear scalings may be an interesting future application.

Collections of morphisms
In machine learning, we are often interested in ensembles of models, evaluating model performance, and model selection. To incorporate these tasks into the MLM framework, we need a notion of multiple MLMs gathered together. Definition 3. Given MLMs ML 1 , ML 2 , . . .,ML k , then the collection C ML of MLMs is the set: We call the number of morphisms, k, the dimension of C ML .
In the numerical experiments presented in Section IV, we use the idea of collections for model selection and evaluation. Once data has been collected, it is standard practice in machine learning to split the data into a Training Set used to develop a workflow and choose optimal parameters, a Validation Set used to assess and tune model performance during development, and a Testing Set used to assess performance after development is completed [24]. Evaluating the performance of a model is a MLM, which we will denote ML eval .
The input space is the set of all collections of k-dimensional MLMs with output space Y, which we will denote V k Y . The output space (often R or R k ) represents the performance of the model(s) under consideration.
There are two major sets of parameters. The first are any parameters p 1 required by the morphisms in the collection, which have prior P(p 1 ). The second are the parameters of the evaluation function, p 2 , which have prior P(p 2 ). Assuming these parameters are indepentent, the final parameter prior is P(p 1 )P(p 2 ).
The morphism evaluates the performance of the collection of MLMs on a new set of realizations. Examples include sensitivity, the Reciever Operating Characteristic (ROC) [25], or the Aikeke Information Criterion (AIC) [26]. A model selection morphism chooses the MLM in a collection that maximizes the chosen performance measure. The risk function performs whatever optimization is required by the performance measure, such as complexity criterion.

Machine learning workflow
Now that we have established operations and composition on MLMs, we define the Machine Learning Workflow (MLW or "workflow").

Definition 4. A Machine Learning Workflow is a finite composition of k MLMs
This definition is inspired by the formulation of artificial neural networks as compositions of functions. Instead of sigmoids, our composition represents different steps in a machine learning application such as data preprocessing, feature extraction, and model training. A workflow can contain any number of MLMs, and each MLM could possibly be broken down into more MLMs. This representation allows us to present an application with a single master equation, which keeps track of the parameter space, and optimizes over a single risk function. This has advantages in both presentation and organization. If we are evaluating several workflows over the same outcome space (for example, models created for the Netflix Competition), we can use the MLM framework to create an ontology of workflows for evaluation and comparison.

Binary classification use case
We are specifically interested in the case when data points x 2 X are sets composed of both real-valued continuous variables and categorical variables, the output space is Y ¼ f0; 1g (binary classification). Further, we are interested in the case when there is a class imbalance, e.g. significantly more observed samples where y = 0 than y = 1. We were developing a machine learning model to identify hospital patients at high risk of 30-day unplanned readmission, which occurs when patients come back to the hospital within 30 days of discharge. We developed this framework to better keep track of how the data was manipulated, and discovered a potentially useful application of the Topological Data Analysis algorithm Mapper as an MLM. The rest of this paper will describe workflows we developed in terms of the MLM framework.

TDA Mapper as a MLM
In this section we present a brief overview of the TDA Mapper algorithm, show that it fits into the MLM framework, and build some example workflows.

Mapper algorithm
Topological Data Analysis assumes that the space X can be endowed with a collection of sub- and are called open sets. Then the ordered pair fX; Og, forms a Topological Space. TDA builds topological spaces on top of data points, and evaluates the shape of the computed spaces [27]. The critical features are closed loops in various dimensions, which are invariant to rotation or multiplicative scaling. When X is also endowed with a probability space, the topological features are also endowed with a probability space [28], and can be used in machine learning. TDA has been used as a novel visualization tool in bio-medical applications [29] [30], text mining [31], and remote sensing [32].
One visualization tool developed for TDA at Stanford is the Mapper Algorithm [4]. It creates a graphical representation of the data that keeps an equivalent topological structure, and has been used in a wide variety of applications [33] [34]. Mapper is usually used as a method for clustering and visualization. Interesting clusters or patterns are used as a feature selection method to reduce the dimensionality of data before training learning models.
To construct the Mapper graph, first define a filtration function A : X ! R (note: it isn't necessary for the range to be R, but we're using it here for simplicity). Then define an equivalence relation * A such that x 1 * A x 2 whenever A(x 1 ) = A(x 2 ), which collapses every level set of A to a single point. The Reeb Graph is the quotient space of X under the relation * A . Mathematically, the Mapper algorithm computationally approximates the Reeb Graph by computing the nerve of a refined pullback of an open cover, O � O, of AðXÞ. Practically, Mapper assigns datapoints to O, and then performs clustering within each member of the open cover. Then it creates graph G with a set of nodes N i 2 N representing the clusters, and a set of edges E where an edge e ij means that two clusters have non-empty intersection. It has been proven to converge exactly to the Reeb graph if O is refined enough [35]. The full algorithm is described in [4], and rough pseudocode of computational algorithm is detailed in Algorithm 1.
In topology, an abstract simplicial complex is a family of non-empty finite sets that is closed when taking non-empty subsets. One of the main ideas of TDA is to create abstract simplicial complexes from sets of data [36]. The pullback operator on an open cover has a more complicated definition that is out of the scope of this paper, but the nerve of an open cover is a representation of the open cover as an abstract simplicial complex. The Mapper algorithm computes the nerve of the O using the procedure in Algorithm 1, and the result is a graph showing the "shape" of the data [4].
Algorithm 1 Description of the TDA Mapper Algorithm by Singh, Memoli, and Carlsson et al.

Input:
• Data X, distance metric D on X, filtration function A : X ! R Perform clustering such as k-means, using b clusters, on x 2 X \ A −1 (I j ) 7: append each cluster N i , for i = 1, 2, 3. . . to N 8: end for 9: for i, j 2 N do 10: if N i \ N j 6 ¼ ; then 11: Append edge e ij to E 12: end if 13: end for 14: return N, E

Machine learning workflows with Mapper
The set of nodes, N = {N 1 , . . ., N w }, output by Mapper, represents a cover, fX i g Because Mapper approximates the Reeb Graph, we use the graph edit distance (GED) [37] between Mapper and the "true" Reeb Graph as the risk function for the Mapper MLM. The graph edit distance between graphs G 1 and G 2 summing up the cost of the operations necessary to transform G 1 into G 2 . These operations commonly include adding/deleting edges, nodes, and changing the labels of nodes . Formally if B = [B 1 , B 2 , . . ., B k ] contains the graph operations necessary to transform G 1 into G 2 and C : B ! R þ is a cost function, then the graph edit distance is: In Section IV, we use grid search to search through the parameters of Mapper. To bring this MLM closer to the realm of statistical learning theory, future work could extend the statistical analysis from [35] to define an computationally tractable loss function and more informative parameter prior. However, in the context of the larger workflow, the choice of risk is less relevant because the parameters are optimized over the final risk function.
We build an example machine learning workflow with Mapper and logistic regression as follows: • The input space is X, which has training realizations X TR , validation realizations X V , and testing realizations X TS .
• ML 0 : Dummy coding the original data matrix, embeds the data into R m .
• ML 1 : Mapper MLM on trained on realizations X TR : • If the first principal component is the filtration function a, then this MLM features a composition with the PCA MLM.
• For computational reasons down the line, we remove vertices with less than 40 data points, so the output is not a total cover of X. When the first principal component is used this seems to have the effect of removing outliers with high/low PCA scores.
• ML 1 outputs a set of spaces fX i g k i¼1 , with training realizations separated into each group fX TR;i � X i g where each ML i 2 has structure: • Input Space: R m , • Output Space: Y ¼ ½0; 1�, representing the class probability P(y = 1), • Morphism: Composition of: • Feature extraction, e.g. PCA, • A learning machines, e.g. Logistic Regression trained on X TR,i • Parameter Prior: Gaussian priors on the regression coefficients for each node, uniform priors on the parameters of the Mapper MLM.
• C i : X ! R is a weighting function depending on where points lie in the input space, related to which nodes of the Mapper graph are "active" for a given data point.
• ML 3 : A decision threshold with • Morphism: • Parameter Prior: prior information of threshold T 2 [0, 1] • Risk Function: Method to choose probability threshold, e.g. choosing an optimal threshold of a ROC curve over cross validation sets.
The full workflow is: This MLW creates and optimizes a separate workflow ML i 2 for each node created by the Mapper graph. Then, we leverage Property 1 and the fact that each workflow outputs class probabilities to create a weighted ensemble of workflows. Fig 1 shows a block diagram represenation of the full workflow using Mapper. Final model evaluation uses M as an input to an evaluation MLM using realizations from the test set X TS . The weights C i (x) in ML 2 represent an interesting choice. Intuitively, weights should be non-zero only when a point lies in X i , so only a portion of the models are "active" for a given point. Because they are summing elements of a probability space, P w i¼1 C i ðxÞ ¼ 1 for all x. Options for the weighting parameters include: • Assign a weight of 1 to the "closest" node and 0 to all others.
• Assign equal weight to all nodes to which the point belongs, and 0 to all others.
• Assign weight inversely proportional to the distance from the center of the interval assigned to that node.
• Assign weight proportional to the cross validation metrics of each model, i.e. models that perform better on the training data are assigned higher weights.
• Train weights within cross validation by defining a loss function based on the metric of interest.
The Mapper algorithm could be replaced with another clustering algorithm such as kmeans, or any other mapping that chooses subsets of data. We chose the Mapper algorithm because the subsets it generates have some appealing properties. First, the points in one X i are all "close" in the sense of the filtration function, but may have a different internal structure than another node to which they are not connected. A column of X i may be positively correlated with the outcome variable, but the same column of X j may be negatively correlated.
When considered in a model over the entire dataset, these correlations may "compete" with each other. Furthermore, with the proper choice of a filtration function, some nodes may have a higher incidence of the outcome variable. In previous literature this was done in order to identify subsets with high minority prevalence for further study. By training a model only on that node, we reduce some of the class imbalance on that set, theoretically increasing model performance. Finally, many clustering methods do not produce overlapping clusters, but in [39], training classifiers on overlapping cluster was shown to improve performance. The Mapper algorithm allows for datapoints that fall into multiple nodes of the Mapper graph to contribute information to each node's model.
To evaluate the workflow M, use the workflow as input to an MLM that evaluates classifier performance over new realizations from X (the testing set). In the next section we focus on ROC area under the curve (AUC), Sensitivity, and Specificity as different training metrics.

Comparisons
We built several versions of Eq 29 across two real world datasets. The workflow breakdown is as follows: • Realizations: Always an 80/20 training/testing split.
• ML 0 : Dummy coding performed using the default parameters from the caret package. , and the number of bins when clustering from . We fixed the filtration function a as the first principal component, and the distance metric d as the gower metric, which means we used a dirac delta as the prior for these parameters.
• ML 2 : Node Models: • • Cross Validation: Used caret package in R to generate 10-fold cross validation sets to tune model hyperparameters, such as the number of trees in the random forests.
• Weighting functions C i : Points are assigned to nodes by computing the filtration function, assigning to appropriate intervals and then finding the closest cluster within each interval. Then weights were assigned to the one or two closest nodes. If two nodes are used, we use either equal weights or weights proportional to the ROC performance on the validation set.
• ML 3 : Decision threshold function as defined in Eq 28.
Mapper graphs used the first principal component as the filtration function, and the other parameters were tuned by grid search. Preliminary investigations with other filtration functions on patient record data revealed that the first PC seemed to yield the best classifier performance, so it was fixed as the filtration function for each experiment. These graphs tended not to find any loops or interesting topological structures when using that particular filtration function. However, there was usually a good spread with respect to the outcome variable where some nodes have a high minority class occurence and others have very low.
For the risk function, we optimized ML 2 independently of the whole process, via methods such as maximum likelihood (for Logistic Regression) using models from caret. The decision threshold in ML 3 and the parameters of the mapper algorithm were selected to optimize the average ROC AUC over the hold-out sets from the cross validation. This was done via grid search.
The results are grouped by the type of learning machine used in ML 2 . Each workflow uses only one of these types, and experimenting with more involved model selection on different nodes X i is an interesting direction of future work. Every workflow tested was run with 10 different training/testing splits, and the resulting performance measures were averaged. The workflows are named by Mapper/No Transformation, PCA (if applicatble), Sampling method (if applicable), node weights (if applicable).

Hospital readmissions data
The original dataset used in development of this project is from a set of 776 medicare/medicaid patients from Barnes Jewish Hospital in St. Louis, MO admitted between April 2015-April 2016. We seek to predict whether or not the patients will be readmitted into the hospital within 30 days after being discharged. The variables collected are: LACE Risk Score, presence of Diabetes, principal diagnoses from ICD9/10 codes, gender, ethnicity, zip code readmission rate, length of stay, age, and presence of a primary care provider. 134 (17.3%) of the patients were readmitted. In each run of the simulation, the data was divided into 621 training and 155 test patients. Some descriptions of the population under study are given in Table 3. Tables 4-7 show the classification results for multiple workflows, themed by the type of classifier, averaged over 10 runs. The chosen Mapper parameters were 10 intervals, 50% overlap, and 20 bins when clustering. Typical Mapper graphs had 10 nodes, each with 40-200 patients per node. We report the ROC AUC, Sensitivity, Specificity, and Accuracy for each model tested, but  Table 5. Results for different workflows of SVMs for hospital readmissions data, with standard deviations over n = 10 runs.

SVM Workflow ROC AUC Sensitivity Specificity Accuracy
No  Table 6. Results for different workflows of random forests for hospital readmissions data, with standard deviations over n = 10 runs. note that accuracy in this case is biased heavily towards the specificity score since negatives make up 83% of the patients. The Mapper graphs were tuned using the first principal component of the entire dataset as the filtration function, a typical graph is shown in Fig 2. Based off of a grid search, a bin overlap of 40-50% yielded roughly the same results, with 10 intervals as the clustering parameter. Each run produced 5-10 nodes, with readmission ranging from 5%-30%.

RF Workflow ROC AUC
This dataset catalyzed the use of the Mapper graph in the ML workflow. Models trained on the entire dataset do not perform well, and we were aiming to build a workflow that improved upon the LACE score. LACE is a combination of Length of previous hospital stays, Acuity of admission (emergency/not emergencey), the Charlson Comorbidity, and number of prior Emergency Department visits, and is the most used tool to predict readmissions risk. However, our population has a huge majority of "high" risk patients, and logistic regression trained on lace results in a ROC AUC of 0.59, with the optimal sensitivity at 0.54, which only correctly predicts slightly more than half of all readmissions. Applying Mapper in to our workflow resulted in large performance increases over the LACE model.
On the Logistic Regression, the Mapper algorithm with PCA computed for individual nodes and SMOTE sampling outperforms all other workflows, with Mapper/No Transformation (NT) and NT + SMOTE also showing higher performance than others. Sensitivity is of particular interest in this problem, in order to identify as many high risk patients as possible and target them with additional resources.
The SVM classifier performed the best out of the out-of-the-box models with no sampling or other transformations, however the Sensitivity was significantly increased by the models using Mapper, Node PCA, and SMOTE/ROSE sampling. None of the models using Mapper in conjunction with Random Forests showed large improvement over training the Random Forests over the entire training set. It should be noted that the models running Random Forests with ROSE sampling in the Caret Package didn't converge, and voted "no-readmission" for every point in the test set. This issue also occured when using Adaboost classifiers with ROSE sampling in Table 7.
Adaboost models were improved both by sampling and in two of the Mapper workflows. One case to note is the PCA+ROSE combination, which features a "high" AUC but low Sensitivity. In this case we would throw out the model in favor of the Mapper, no transformation, SMOTE workflow which correctly identifies almost 3/4 of the readmitted patients. One reason AdaBoost models might perform better with the Mapper algorithm is that AdaBoost is often used on smaller datasets, which the Mapper workflow naturally creates. Compared to the out of the box models trained using caret, workflows utilizing Mapper tend to have a higher variance between runs. This can be explained by additional variance introduced by assigning the testing points to different nodes of the Mapper graph. Since each run has a different testing set, different node models will predict different numbers of testing points. Additionally, variance is introduced by using SMOTE or ROSE sampling methods,

Additional testing: German credit data
For replication and additional testing, we built workflows on German Credit data available through UCI's machine learning data repository [41]. The data features 20 features, which are described in Table 8. We chose this dataset because it has a mix of categorical and continuous features, and a similar class imbalance to the Hospital Readmissions data. The selected Mapper parameters were 10 intervals, 30% overlap, and 15 bins when clustering. A typical Mapper graph created 7 nodes, with 50-400 points per node, and is shown in Fig 3. The goal of these workflows is to classify applicants into "good" or "bad" credit risk.
In the numerical comparisons, we encountered convergence issues with the AdaBoost algorithm and ROSE sampling scheme, so we do not report experiments featuring those MLMs.
On the credit data, we also experimented with adding the weighted interactions between two models. The results of the comparisons are posted in Tables 9-11. The weights were either equal, or weighted towards the best ROC AUC. If a testing point was assigned to nodes i and j, we used the weights: where AUC i and AUC j are the average of the ROC AUC's computed on the holdout sets during cross validation. An interesting property of this dataset is that training models with no transformations yields high ROC AUC (compared to models trained on the readmissions data) across every model, with roughly the same performance when PCA is applied across the whole dataset. The Mapper/Logistic Regression workflow produced lower ROC AUC's when no transformations were used, but similar AUC's when node PCA was applied, and slightly lower specificity.
Random forest classifiers using the Mapper workflow performed worse in AUC than training a random forest on the original dataset. Most of the loss in performance is due to much lower specificity across the board. When applying node PCA to random forests, the performance decreased, unlike the Logistic Regression case. SVM classifiers with Mapper also performed poorly across the board when no node transformations are applied. However, applying PCA to Mapper nodes significantly outperformed Mapper workflows with no transformations, and had similar results to workflows trained on the entire dataset.
When a weighted combination of two models was used instead of one model (more than one C i (x) > 0), the results were either the same or slightly improved. One reason for the same results is that models tend to be similar when trained with nodes that overlap with each other, while models from nodes on opposite ends of the Mapper graph are different. This means models give roughly the same probability, and therefore the weighted combination will be roughly the same as just using the closest model. More experimentation and analysis should be done to tune these weights.

Discussion
In summary, we presented the machine learning workflow as a composition of machine learning morphisms, starting from the original data space X, and ending with task completion. We presented several properties of these morphisms, including when groups of MLMs form a vector space. Then we presented a workflow using the TDA Mapper algorithm as an MLM to split X into overlapping subsets and train models on each subset. We discussed implications of this workflow and presented results from two applications on real data. In the case of Hospital Readmissions we found several models using the Mapper algorithm that yielded increased performance. In the case of the German Credit data, models using Mapper and node PCA performed similarly to workflows trained without Mapper. However, the closeness of some of the models suggests that further tuning may be all that is needed to improve performance.
Besides potential performance improvements, workflows utlizing Mapper provide the advantage of parallelization. On the German Credit Data, we showed several cases where models trained on smaller datasets and then pooled together performed just as well as models trained on the whole dataset. Once the data is split, each model is trained on a significantly smaller portion of the data. The training can be parallelized, which has the potential to significantly decrease training time on large datasets, or on models such as neural networks and SVM that scale nonlinearly with the number of data points. Some limitations with the numerical experiments include limited data: each node is only assigned a few samples from the test set, which adds some uncertainty to the results. The bounds of uncertainty should be further analyzed, and these workflows should be analyzed by studying datasets with more samples.
One direction for future work is to examine the theoretical performance of the workflow utilizing TDA. TDA is a natural fit for this framework because it is intrinsically interested in the study of data spaces and the morphisms between those spaces.
Other directions of future work include studying the optimization of MLM based workflows given a choice of risk functions. When should parameters be optimized independently, and when should they be formulated as a joint optimization over a risk function? When the morphisms and risk functions are all continuous and differentiable functions, the workflow is similar or equivalent to a simple feed forward neural network, and we could leverage the research done on neural networks to jointly optimize large ensembles of models. Finally, vector spaces of MLMs could be leveraged to build and analyze ensembles.
In conclusion, the MLM framework allows us to concisely build multiple well organized workflows, from preprocessing to task completion, and provides a medium for performance analysis and optimization of all workflow parameters.

Appendix
[Proof of Property 1] Proof. Given that they have the same output space, to show that ML 3 ¼ ML 1 � ML 2 , it is sufficient to establish that the optimal parameters of ML 3 are the same as the optimal parameters of ML 1 and ML 2 . Let