A modeling and machine learning approach to ECG feature engineering for the detection of ischemia using pseudo-ECG

Early detection of coronary heart disease (CHD) has the potential to prevent the millions of deaths that this disease causes worldwide every year. However, there exist few automatic methods to detect CHD at an early stage. A challenge in the development of these methods is the absence of relevant datasets for their training and validation. Here, the ten Tusscher-Panfilov 2006 model and the O’Hara-Rudy model for human myocytes were used to create two populations of models that were in concordance with data obtained from healthy individuals (control populations) and included inter-subject variability. The effects of ischemia were subsequently included in the control populations to simulate the effects of mild and severe ischemic events on single cells, full ischemic cables of cells and cables of cells with various sizes of ischemic regions. Action potential and pseudo-ECG biomarkers were measured to assess how the evolution of ischemia could be quantified. Finally, two neural network classifiers were trained to identify the different degrees of ischemia using the pseudo-ECG biomarkers. The control populations showed action potential and pseudo-ECG biomarkers within the physiological ranges and the trends in the biomarkers commonly identified in ischemic patients were observed in the ischemic populations. On the one hand, inter-subject variability in the ischemic pseudo-ECGs precluded the detection and classification of early ischemic events using any single biomarker. On the other hand, the neural networks showed sensitivity and positive predictive value above 95%. Additionally, the neural networks revealed that the biomarkers that were relevant for the detection of ischemia were different from those relevant for its classification. This work showed that a computational approach could be used, when data is scarce, to validate proof-of-concept machine learning methods to detect ischemic events.

In this work, the networks were implemented using a general framework that allows to train dense neural networks for any application. This framework is implemented to take advantage of the improved computation capabilities available when using matrix operations. Consequently, each epoch of the training process is done "in one go", so all the training examples are processed simultaneously. Additionally, a bias term is included in every layer, as is common practice in deep learning approaches. Consequently, the input to the network is a matrix of the form: The input matrix (X b ) has size N × (M + 1), where N is the number of training examples, M is the number of input features and term x ij corresponds to feature j of example i. Then, the network performs an affine transformation of the inputs to project them into the first hidden layer. This requires a matrix of weights: which has size (M + 1) × S, where S is the number of hidden neurons in the first hidden layer and term θ (1) ij will multiply input feature i when performing the affine transformation from the input layer into neuron j. Then, the affine transformation for all the training examples is done with the matrix multiplication: where Z is a matrix of size S × N , and each term is the affine transformation of input example j projected onto neuron i of the first hidden layer. In this work, the hidden neurons were defined as rectified linear units (ReLU ), which means that the activation matrix of the first hidden layer (A (1) ) is defined by the element-wise operation: Then, a row of ones (a bias term for each example i) is added on top of the activation matrix and this new matrix (A (1) b , of size (S + 1) × N ) is used to calculate the activations of the second hidden layer using another weight matrix (Θ (2) ) such that: and this is repeated for all the hidden layers in the network. It has previously been shown that varying the number of hidden neurons in the layers gives no advantage in the performance of a dense network. Consequently, it is common practice to maintain the number of hidden neurons constant across all the hidden layers. Then, all the matrices Θ (i) for i ∈ [2, L], where L is the number of hidden layers in the network, will be of size (S + 1) × S and all the activation matrices A (i) for i ∈ [2, L] will be of size S × N . The output layer of the network is composed by O output neurons. To calculate the output of the network, the activations from the last hidden layer are projected onto the output layer through an affine transformation with a weight matrix of the form: and the output matrix of the network (O) is calculated using a sigmoid activation function for the output units so that: where the element-wise sigmoid function (σ) is defined as: With these definitions, the output matrix (O) will be of size O × N , where each column corresponds to the output of the network for each of the training inputs, respectively. Once all the examples have been propagated through the network, it is said that one epoch has passed. At the end of each epoch, the error produced by the network is calculated and a learning step is taken as explained in the following section.

Cost function and gradient descent
In this work, the cost function used was the binary cross-entropy. This is a computationally efficient formulation of the cross-entropy maximum likelihood estimator in the particular case of binary classification. When using one-hot encoding this function can be extended to multi-class classification by considering each output neuron as one binary classifier. Additionally, the function was regularized with weight decay, which prevents the network weights from being too large and, consequently, avoids overfitting the training data. This cost function, calculated at the end of each training epoch, is defined as follows: where Θ is a vector containing all the weights used in the network. At the end of each epoch, this cost function was calculated and its derivative with respect to each of the weights (∇ Θ J) was obtained using the back-propagation algorithm. Then, the weights were updated following the gradient descent: where ε is the learning rate, and the updated weights were used in the next epoch. This procedure was repeated until the cost was low (J < 10 −4 ), there was a small difference in the cost between epochs (∆J < 10 −10 ) or for a limit of 50000 epochs. The learning rate (ε) needed for the training of each network was obtained by analyzing the training error as presented in Fig. 1. It shows the progression of training error for ANN 1 with 5 hidden layers and 7 hidden neurons per layer. The four training procedures were made using the same set of randomly selected examples and starting with the same randomly initialized weights. Figure 1(A) exemplifies how high learning rates (ε = 0.1 and ε = 0.01) resulted in an oscillating behavior that consistently missed the local minimum of the error, whereas low values (ε = 0.001) converged too slowly. Hence, the value of ε = 0.005 was the best option for learning ANN 1 (see Fig. 1(B)). An equivalent approach was taken to select the learning rates required for ANN 21 (ε = 0.01) and ANN 22 (ε = 0.1).

10-fold cross-validation
It is clear that using all the examples from the database for training would produce overfitting. Moreover, a way of estimating the generalization error of the models is required. Consequently, each neural network reported in this paper was trained using a 10-fold cross-validation as follows.
First, all the examples were divided into a training set (X train ) and an evaluation set (X eval ), which contained 75% and 25% of the data, respectively. The examples to be included in each subset were selected randomly, using a uniform distribution, which preserved the balance in the classes. The evaluation set is not used during training, so the performance of a model in this set can be regarded as an estimate of its generalization capability.
Second, the training set was divided into ten subsets of equal size (i.e. X (1) train , X (2) train , . . . , X train ). As before, these were selected randomly following a uniform distribution. The deep learning model was then trained, as explained before, using the combination of sets X (2) train until X (10) train and, once trained, its F 1 -score was calculated on X (1) train . Subsequently, the training was repeated but leaving X (2) train out for testing, then X (3) train , and so on until all sets were used for testing. Finally, the model that produced the highest F 1 -score was selected as the best model and it was applied on X eval .
The sensitivity, positive predictive value and F 1 -score reported in this work are those obtained when applying to the evaluation set the model that performed best during the cross-validation . It should be observed that dividing the database into a training and an evaluation set enables the estimation of the generalization error of the trained model because it is tested in data that was not used during training. Furthermore, training using cross-validation allows to avoid many of the common issues found in machine learning such as overfitting and bias and helps minimizing the effect of outliers in the data.

Baseline performance
From the results shown in the main text, it should be evident that this classification task is not trivial. Indeed, the overlap in the values of the features measured from the pECGs imply that ischemia can manifest itself in different pECG morphologies. However, in order to provide a baseline performance to compare to, three logistic regression classifiers were trained. Namely, the output of a logistic regression classifier is: and the weights (θ i ) can be trained following the same learning procedure and cross-validation explained before. Three logistic classifiers were trained, which used as inputs the magnitude of the pECG features: LR 1 classified between Control and any other signal, LR 2 between Mild and any other signal and LR 3 classified between Severe and any other signal. It should be observed that, in this learning task, there is a slight misbalance between the classes; however, this is not expected to have a significant impact in the performance of the model. The evaluation metrics obtained after training these models are presented in Table 1.

Deep learning performance
The optimal results presented in the main text were obtained through a grid search approach. Namely, for each of the models proposed (i.e. ANN 1 , ANN 2 and ANN 3 ), the number of hidden layers (L) and hidden neurons per layer (S) were varied incrementally, starting at 1, until an optimal performance was obtained. The results of most of the experiments made are presented in Tables 2, 3 and 4, for ANN 1 , ANN 2 and ANN 3 , respectively. For succinctness, the network topologies that were uncapable of learning or had a performance worse than the linear regression classifiers are not shown in the tables. The optimal topology was considered found once no significant improvement was observed by incresing the complexity of the network or when a decrease in performance was observed. The tables highlight in bold the topology that showed the best performance and, consequently, was reported in the main text. Finally, to give an idea of the cross-validation training, testing and evaluation process, Table 5 shows the performance of the best models (highlighted in the previous tables) in the best cross-validation fold. The table shows the sensitivity, positive predictive value and F 1 -score obtained at the end of training and the previously reported evaluation performance.