Dynamic classification of fetal heart rates by hierarchical Dirichlet process mixture models

In this paper, we propose an application of non-parametric Bayesian (NPB) models for classification of fetal heart rate (FHR) recordings. More specifically, we propose models that are used to differentiate between FHR recordings that are from fetuses with or without adverse outcomes. In our work, we rely on models based on hierarchical Dirichlet processes (HDP) and the Chinese restaurant process with finite capacity (CRFC). Two mixture models were inferred from real recordings, one that represents healthy and another, non-healthy fetuses. The models were then used to classify new recordings and provide the probability of the fetus being healthy. First, we compared the classification performance of the HDP models with that of support vector machines on real data and concluded that the HDP models achieved better performance. Then we demonstrated the use of mixture models based on CRFC for dynamic classification of the performance of (FHR) recordings in a real-time setting.


Introduction
Fetal heart rate (FHR), along with other physiological signals, is routinely monitored before and during labor to assess fetal health. The first fetal monitor became commercially available in 1968 [1], and ever since, electronic fetal monitoring (EFM) has been widely used in hospitals in the U.S. Its rate of use in obstetric practice has climbed from 68.4% in 1989 to 85.2% in 2002 [2].
Nowadays, the evaluation of FHR signals is primarily performed visually by experienced physicians, following guidelines published by various medical institutions including the National Institute of Child Health and Human Development (NICHD) [3] and the International Federation of Gynecology and Obstetrics (FIGO) [4]. These guidelines define different patterns of FHR, such as baseline, variability, acceleration and deceleration. Based on the combination of appearances of certain patterns, the FHR tracings are classified into three classes, "normal", "indeterminate" and "abnormal." Notwithstanding the long presence of FHR tracings in obstetrics, their use for assessing the well-being of fetuses has constantly been questioned. For  been reported that the subjective assessment of FHR tracings exhibits large inter-and intravariability [5]. The study has also shown that the sensitivity of clinicians' majority vote to objective outcomes was only 39%. This and similar findings suggest that the high false positive rates have led to increase in the rate of cesarean section deliveries [6], which altogether have put the benefits of using EFM under criticism [7]. The deficiencies of subjective assessment of FHRs raise the need for modern and computerized methods for their processing. Such methods would be able to provide objective and consistent evaluation and to capture hidden dynamics in FHR signals, which are often too challenging for human's eyes' inspection. Furthermore, machine learning techniques have been proved to be extremely successful in real-world applications in various fields in recent years. The advances in machine learning have been also reflected in research on FHR classification. This research has produced a number of newly proposed computerized methods.
In one approach, in addition to morphological features proposed in the guidelines [3], for quantifying the underlying patterns of FHR, advanced features extraction algorithms were applied. In [8], the authors worked with several linear and nonlinear features. The former included short-term and long-term variabilities whereas the latter were features related to power spectra and entropy. In [9], a more comprehensive collection of non-linear features were used to model the non-linearity of FHR signals. These features included fractal dimension, approximate entropy, sample entropy, and the Lempel Ziv complexity. The classification of the features was carried out by a support vector machine (SVM) algorithm.
SVMs have not been in the only machine learning methodology for classification of FHR signals. In [10], artificial neural networks were employed as the classifier with 6 FHR features and 6 clinical variables as inputs. Two types of generative models, naïve Bayes and hidden Markov models, were implemented in [11], which were novel attempts because the majority of the methods in the literature were based on discriminative models. In [12], the authors explored the performance of linear regression and SVMs with different kernels. This work also included feature selection and the use of reduction methods such as random forest and principle component analysis (PCA).
The search for better solutions based on machine learning algorithms with more flexibility and robustness as well as better overall performance has continued. Hierarchical Dirichlet process (HDP) mixture models [13], for instance, free the classic mixture models from fixing the number of mixing components, and allow for modeling of grouped data jointly. These models exhibit excellent performance in areas such as information retrieval and topic modeling [14].
In this paper, for classifying FHR signals, we propose two novel approaches based on HDP. We describe the underlying principles of the approaches and show on real-world data that they have very good performance in terms of accuracy and probabilistic interpretations of the results.
The paper is organized as follows. First, we provide some background on non-parametric Bayesian statistical models used in the paper, and we discuss their advantages over traditional parametric models (Hierarchical Dirichlet Process Mixture Models). Then, in the subsequent section, Features for Classification, we describe the set of features we used in our experiments. The details of our experimental setting, including the database, the pre-processing of the data, the dimensionality reduction of the feature space, and the performance assessment are explained in the section Experimental Settings. Results obtained by our approach and comparisons with an existing method are provided in the section Results. In the last section, we discuss the results and make final conclusions.
The main contribution of this paper lies in the novelty in applying hierarchical Dirichlet process-based mixture models to FHR classification. In our previous research, we explored a set of suitable features for our models and obtained several preliminary results with HDP mixture models [15,16]. Here, we continue to explore the potential of the models and present classification results with probabilistic interpretations. An important extension of our work is the use of the time-varying models from [17] to achieve dynamic real-time classification.

Hierarchical Dirichlet process mixture models
In this section, we describe the two mixture models that we implemented in our experiments, the well-known HDP mixture model and our time-varying modification of it. We explain how one can generate data by using the models and then how one can conduct inference about the model unknowns from data that are generated by the model.

Notation
In the problems of our interest, the observations are organized into groups. We adopt the notation from [13], where x j,i denotes the ith observation in the jth group. We consider that each observation is drawn independently from a mixture model. In the context of FHR problems, each group x j = (x j,1 , x j,2 , . . .) corresponds to features of one FHR recording, and each observation x j,i to features of one segment of the recording.

Models
We start with the HDP mixture models proposed in [13] and then describe their modification proposed in [17] to accommodate for time-evolving statistics of the data.
Hierarchical Dirichlet process mixture models. A hierarchical Dirichlet process defines a set of random probability measures G j linked to a global random probability measure G 0 . Specifically, G 0 is distributed as a Dirichlet process (DP) with concentration parameter γ and base probability measure H, i.e., The random measures G j 's are conditionally independent given G 0 , and distributed according to where α is also a concentration parameter. We explain this model and its extension to mixture models by way of the Chinese restaurant franchise (CRF) metaphor. Suppose that there is a restaurant franchise with a shared menu across the restaurants. With x j,i we denote the ith customer in the jth restaurant, and with θ j,i , the dish type served to this customer. In this setup, the customers correspond to the observations x j,i , a restaurant corresponds to an FHR recording x j and the dish type to a parameter set of a distribution used for drawing the observations. The index z j,i is the index of such parameter set and associated with the observation x j,i . Next, we introduce K iid random variables ϕ 1 , . . ., ϕ K , which represent global dishes and which are distributed according to H. Each customer x j,i is seated at a table, denoted by t j,i , and each table is paired with one dish ϕ k . Furthermore, let ψ j,t represent the dish served on table t in restaurant j, and k j,t be the indicator of the dish served on table t in restaurant j. For example, t 3,4 = 6 means that customer 4 in restaurant 3 sits at table 6, ψ 3,6 = ϕ k 3,6 signifies that on table 6 in restaurant 3 dish ϕ k 3,6 is served, where k 3,6 2 {1, 2, Á Á Á, K}. With this notation, we have that θ 3,4 = ϕ z 3,4 , where z 3,4 2 {1, 2, Á Á Á, K}. Note the difference between k j,t and z j,i . The former is the index of the dish served in restaurant j on table t, and the latter, the index of the dish served to customer i in restaurant j. We also need a notation for counts. With n j,t,k we denote the number of customers in restaurant j at table t serving dish k, and with m j,k , the number of tables in restaurant j serving dish k. We represent marginal counts by dots. For example, n j,Á,k represents the number of customers in restaurant j eating dish k; m j,Á represents the number of tables in restaurant j, and so on. At each table in each restaurant, one dish from the menu is ordered by the first customer at that table and shared by the remaining customers sitting at the same table.
A customer entering a restaurant can either choose an occupied table according to a probability proportional to the number of customers already seated at the table, or get a new table with a probability determined by the concentration parameter α. Specifically, in restaurant j, the ith customer chooses a dish (and thereby a table) according to where δ ψ j,t is probability measure concentrated at ψ j,t . If a customer chooses an existing table, say t, then we increment n j,t by one, and set θ j,i = ψ j,t , t j,i = t, and z j,i = k j,t . If a new table is chosen, then we increment m j,Á by one, draw the dish for that table ψ j,m j,Á +1 * G 0 and set θ ji = ψ j,m j,Á +1 , t j,i = m j,Á + 1, and z j,i = k j,m j,Á +1 , where k j,m j,Á +1 is the index of the drawn dish from G 0 . Now let us consider the dish-level distributions. Similarly, a table can be served with an existing dish with probability proportional to the number of tables already serving the dish in the whole franchise, or with a new dish with probability determined by the concentration parameter γ. To be specific, the probability distribution of table t in restaurant j serving a particular dish is given by If an existing dish is served, i.e., k j,t 2 {1, 2, Á Á Á, K}, we increment the count of that dish, m Á,k j,t , by one, and set ψ j,t = ϕ k j,t . If we choose a new dish, then we increment K by one. We also draw the new dish by ϕ K+1 * H, and set k j,t = K + 1. This completes the description of the CRF metaphor. We summarize the variables, their meanings and how they relate to our problem in Table 1. We reiterate that the dishes are shared among the restaurants, which corresponds to a key property of the HDP. The HDP mixture model is a non-parametric Bayesian approach to data processing. It aims at modeling grouped data jointly, where each group (segment features of an FHR recording) is associated with a mixture model, and all the mixing components are shared across the groups (different FHR recordings share features). We assume that each dish type ϕ k defines a mixing component that is used for generating actual dishes (features). We denote the generating distribution of the features by F(ϕ k ). In summary, each observed feature x j,i (the features of segment i of recording j) is generated by where ϕ z j,i is the parameter of the feature distribution, and z j,i is the index that defines the parameter. By setting the F's to be Gaussian distributions, we obtain a Gaussian mixture model with HDP as the prior. Chinese restaurant franchise with finite capacity. Now we consider a modified version of the CRF that was proposed in [18]. Assume that each restaurant has a limited capacity of accommodating customers, and without loss of generality, we assume that it is N for all the restaurants. Before the number of customers reaches that limit, the process is the same as in the CRF metaphor. After a restaurant is "full," a new customer can come in and be seated only after the "oldest" customer leaves the restaurant. Then, the ith customer in the jth restaurant, where i > N, chooses a dish by where the Ã notation represents the changes after the oldest customer (the (i − N)th of restaurant j) leaves. Similarly, we update the table counts after the table is chosen. In addition, the probability that table t in restaurant j serves a particular dish type is c j;t jc 1;1 ; c 1;2 ; . . . ; c 2;1 ; . . . ; c j;tÀ 1 ; g; H $ After the dish is selected, the dish counts are updated accordingly. We call this new process "Chinese restaurant franchise with finite capacity" (CRFC). The CRFC is designed to model grouped time-varying data, and capture the underlying dynamics. Simulation results on how the CRFC mixture model finds the cluster assignments of data over time can be found in [17].

Inference
We describe a Markov chain Monte Carlo (MCMC) sampling scheme for estimating the parameters of the HDP and CRFC mixture models. This is a Gibbs sampling scheme based on the CRF [13]. To simplify the inference, the base distribution H is assumed to be conjugate to the data distribution F. For the non-conjugate case, the sampling approach can be adapted from techniques developed for non-conjugate DP mixtures [19]. In addition, here we assume known values for the concentration parameters α and γ. When they are unknown, we describe a sampling scheme for them in a later section. In the sequel, the notation x −ij signifies x = (x j 0 i 0 : all j 0 i 0 except j, i), i.e., x −j,i = x\x j,i . Similarly, t −i,j = t\t j,i and k −j,t = k\k j,t . To make the sampling more efficient, instead of directly dealing with the x j,i s and z j,t s, we sample their indicator variables t j,i and k j,t . We first describe the sampling of t and then the sampling of k.
Sampling t. The prior probability of t j,i taking an occupied table is proportional to n À j;i j;t;Á according to Eq (3), where, as before, the notation −j,i means the corresponding variable is removed from a set or a count. And the prior probability of t j,i taking a new value is proportional to α. More specifically, t j,i is sampled from if t is previously used where f À x j;i k j;t ðx j;i Þ represents the likelihood of sample x j,i belonging to an existing mixture component k j,t given all the other data, and is given by If a new table is chosen, i.e., t j,i = t new , we need to draw a dish k j,t new for t new , and the probability is is the prior density of x j,i . Therefore, the likelihood of a customer choosing a new table is During sampling, some n jt. may become zero, i.e., the corresponding table t may become unoccupied. Then we need to update the corresponding dish count m Á,k , which may result in deleting some mixture component if m Á,k = 0.
Sampling k. The likelihood of setting k j,t = k is given by f À x j;t k ðx j;t Þ, where x j,t represents all the x j,i s such that t j,i = t, and f À x j;t k ðx j;t Þ is the conditional density of x j,t given all the data related to component k without x j,t . For the conditional probability of k j,t , we can write The inference of CRFC mixture models can easily be obtained by changing the prior probabilities of the indicator variables t j,i and k j,t .

Features for classification
Here we present the complete list of features of FHR traces that we used for classification. As mentioned before, feature extraction has attracted much attention in the field of FHR analysis. The features can roughly be divided into three categories: time domain, frequency domain, and non-linear features. Time-domain features measure the variability of FHR signals in various forms, whereas the frequency domain features usually describe the powers in different frequency bands. Non-linear features quantify the non-linearity of FHR, e.g., with entropy and fractal dimension.
In our experiments, we divided the FHR series into non-overlapping segments, with length ranging from 40 to 120 samples. Then, from each segment we extracted one feature vector. This vector did not contain nonlinear features. Instead, it had 9 features from the time domain, and five from the frequency domain. The reason for not including non-linear features is that their reliable estimation usually requires much longer segments [9]. For example, the approximate entropy is applicable when the data series are longer than 100 samples [20].
In summary, we used only linear features from the time and frequency domains that are known from the literature on fetal heart rate processing. In the classification, we used 14 features, which are described in the next two subsections. However, the classifier operated in a feature space with reduced dimension and obtained via principle component analysis (PCA), as explained in the next section.

Time-domain features
They include the mean and the standard deviation of the segment s ji . In addition, we also use the short-term variability (STV) and long-term variability (LTV), which are defined in [8] as where s(k), k = 1, . . ., K represents one segment of FHR series, K is the number of samples in each segment and M is the number of minutes of the segment. STV and LTV essentially quantify the changes of FHR series in different forms. On the feature list, we also have the short-term irregularity (STI) and long-term irregularity (LTI) from [21]  where IQR stands for inter-quartile range with k = 1, . . ., K. In essence, STI and LTI describe the variability of FHR series too. The other features are the standard descriptors of the Poincaré plot, SD1 and SD2, as well as the complex correlation measure (CCM) proposed in [22], which are defined by where γ RR (0) and γ RR (1) are the autocorrelation functions for lags 0 and 1 of the RR intervals, and RRbeing the mean of the RR intervals. The RR intervals are another representation of FHR, which stands for beat-to-beat interval and that can be obtained by CCM, on the other hand, is a function of several lags of the autocorrelation functions of the RR intervals, or more specifically, where C n is a normalizing constant, defined as C n = π × SD1 × SD2, and m is an integer. In our experiments, we set m = 1. These features are different types of descriptors of FHR variability.

Frequency-domain features
These features represent powers in four frequency bands: very low frequency (VLF: 0-0.06 Hz), low frequency (LF: 0.06-0.3 Hz), medium frequency (MF: 0.3-1 Hz) and high frequency (HF: 1-2 Hz). In addition, they also include the ratio of powers of two bands LF/(MF+HF). The frequency-domain features represent the underlying physiological activity of either the mother or the fetus. It is worth noting that there is no consensus on how to define the frequency bands. In our experiments, we used the ranges from [23]. The complete list of features is shown in Table 2.

Experimental settings
In this section, we describe in detail our experiments of classifying FHR signals using nonparametric Bayesian models.

Database
In our work, we used the open-access cardiotocography (CTG) database collected from the Czech Technical University (CTU) and University Hospital in Brno (UHB) [24]. This database contains 552 CTG recordings, each comprising an FHR time tracing and a uterine contraction (UC) signal, both sampled at 4 Hz. All recordings start at a maximum of 90 minutes before delivery. Fetal outcome data, which include measurements of umbilical artery blood samples and Apgar scores evaluated at 1 and 5 minutes after delivery, are available for assessment purposes. Additional fetal and delivery information, such as sex, weight, type of delivery, are also collected. More details on the data collection can be found in [25].

Pre-processing and segmentation
The acquisition of FHR signals suffers from different kinds of artifacts, which are generally caused by maternal and fetal movements or displacements of the transducer used in the acquisition. There are two types of artifacts, either the measured samples are incorrect or they are simply missing (the values are equal to 0). Therefore, the FHR signals have to be pre-processed before they are used for analysis. In practice, any successive samples with differences greater than 25 bpm are considered as artifacts. All artifacts, including missing data with duration less than 15 seconds, are interpolated by piecewise cubic Hermite polynomial method. If the duration is longer than 15 seconds, they are simply discarded. Fig 1 shows an example of an FHR series before and after pre-processing. Out of 552 FHR recordings, we selected a balanced dataset with the same number of recordings and labeled as healthy and unhealthy. The labels were defined by the following criteria: an FHR recording is healthy if its associated umbilical cord pH value is greater than a threshold τ 0 , and it is labeled as unhealthy if the pH value is less than or equal to τ 1 . There is no consensus on the exact values of the thresholds, so we experimented with τ 0 = 7.2 and both τ 1 = 7.05 as in [9] and τ 1 = 7.1 as in [10]. The number of recordings N in the selected dataset ended up with 88 and 122 respectively.
In our experiments, the last M-minute data of the FHR recordings were analyzed. Each recording was divided into non-overlapping segments of l seconds, where l ranged from 10 to 30 seconds. Thus, the number of segments in each series was m, where m = M × 60/l. For each segment s ji , which is the i-th segment in the j-th recording, a feature vector x ji of dimension d was extracted.

Dimensionality reduction
As described in Section 1, in our experiments, each feature vector has 14 dimensions. High dimensionality is usually difficult to deal with, specifically in terms of issues such as computational costs and convergence in Gibbs sampling. Hence, before training the models, we reduced the dimension of the feature space from 14 to q by way of principle component analysis (PCA) [26]. Since PCA is sensitive to the scales of different dimensions of input data and the ranges of feature values in each dimension can vary largely, we scaled these values into the interval (−1, 1) before applying PCA. After scaling, we computed the variance ratio of each component. An example of PCA results of all the data when the number of recordings N = 88 and the segment length l = 10 is shown in Fig 2. The gray bars are the explained variance ratios of each principle component, and the blue line represents the cumulative variance ratio.
According to the preliminary analysis of all the data, we concluded that most of the variance lies in the first 4 principle components. Therefore, we experimented with different choices of q = 2, 3, and 4. Note that in each iteration of cross-validation, only the training data were used to obtain the linear transformation matrix, and the testing data were transformed accordingly.

Model priors
The HDP and CRFC mixture model both have two concentration parameters, γ and α, as described in Section 1. Instead of assigning fixed values to them, we implemented an auxiliary sampler provided in [13] to infer them. In our experiments, the concentration parameters were given gamma priors, γ * gamma(1, 1) and α * gamma(10, 1). Therefore, in our experiments, we needed to choose only two variables: the segment length l and the feature dimension q after PCA.

Classification process
The process of using HDP based models to classify FHR tracings is as follows. The last 30-minute data were used in the classification tasks. During the training stage, two HDP Gaussian mixture models (HDPGMs), M 0 and M 1 , were constructed from the FHR recordings and labeled as healthy and unhealthy, respectively. For estimation of the models' parameters, we implemented the collapsed Gibbs sampler (proposed in [13]). During the testing stage, given a new FHR tracing x j , the classification is made by comparing the likelihoods L 0 and L 1 , which Dynamic classification of fetal heart rates are defined by log f ðx ji jM 0 Þ; If L 0 > L 1 , the FHR series is classified as healthy and vice versa. Note that here we assume that the priors of the fetuses were equal. In using the CRFC Gaussian mixture models, first we set a window length M win equal to 30 minutes. Essentially, this is equivalent to the restaurant capacity in the CRF metaphor. We analyzed the last 45 minutes of FHR recordings. Two models, M 0 and M 1 were initiated from the first 30-minute data (i.e., the last 45 to 15 minutes from the original FHR series) from the respective groups. At each time instance, we moved the window by one segment, and trained the models by adding new data and removing the oldest data. The likelihoods of being healthy and unhealthy, L 0 and L 1 , were computed similarly to (22).
We define the probabilities of FHR series corresponding to healthy or unhealthy fetuses, denoted as p 0 and p 1 , by where m is the number of segments in each FHR series. We call this method the "naïve approach". A modified version of the probabilities is defined as follows.
and w i 's are weights defined as where u i is the percentage of data that are not interpolated in the i-th segment, which is a measure of signal quality. We call this method the "weighted approach".

Performance assessment
We assessed the classification performance of the models with the standard metrics, true positive rate (TPR) and true negative rate (TNR). We also used the weighted relative accuracy (WRA) [27], which is defined by WRA = 4 × cost × (TPR − FPR)/(1 + cost) 2 , where FPR represents false positive rate. In this study, we assigned the cost to 1.
To fully utilize the dataset and avoid the bias caused by randomly selecting training/testing data, we used the 5-fold cross-validation (CV) method for performance assessment. At each iteration, 80% of the data were used for training and the rest for testing. The outcome metrics were averaged across all iterations and the mean values were reported.

Results
In this section, we first provide the classification performance of HDPGMs and the comparison with that of SVMs, which achieved the best performance in studies [9,12]. Then we show the real-time classification of FHR tracings by models based on CRFC.

Performance by HDPGMs
As described in Section 1, we experimented with two different thresholds τ 1 that delineate the non-healthy group of fetuses. By setting τ 1 = 7.05, the number of recordings with total length exceeding 30 minutes N is 88, and for τ 1 = 7.1, N equals 122. After segmentation, feature extraction and PCA, the dataset was transformed to N groups of data, each group containing m observations of dimension q. We experimented with different choices of segment length l and dimension q. The results, with the best performance highlighted in bold font, are provided in Table 3,.
The same datasets were used to test the SVM-based method. The classification process was as follows: instead of segmentation, the 14 features were extracted from the whole FHR series of the last 30 minutes. The feature vectors were scaled to the range (−1, 1), and then used as input to the SVMs classifier. The SVMs classification algorithms had two free parameters: cost C and γ. We searched for the optimal combination of these parameters in terms of the testing performance metric, WRA. Five-fold CV method was used to eliminate biases. The results obtained by SVMs are shown in Table 4. Dynamic classification of fetal heart rates By comparing the results in Tables 3 and 4, we conclude that in both cases of τ 1 , the proposed method outperformed the SVM-based method.

Real-time classification
In this experiment, we set the threshold τ 1 = 7.05. The number of FHR recordings of total lengths greater than 45 minutes is 70. We randomly chose 60 recordings for training the CRFC Gaussian mixture models, and the rest for testing. Due to lack of labels of FHR recordings at each time instant, we assumed that the training series stayed in the same group for the whole duration. At each time instant, we computed the probability of the FHR series being associated with a healthy fetus. We used both, the naïve and the weighted methods.   From the results, we can observe small differences between the two approaches. The probabilities obtained by different experimental settings are not identical but agree with each other in terms of the overall trend.

Conclusion
In this paper, we implemented the hierarchical Dirichlet process mixture model and its variation in classifying fetal heart rate tracings. In our method, we employed 14 features that have been used in the literature before. We showed that our method outperformed the state-of-theart algorithm in terms of weighted relative accuracy when using the same feature set. Furthermore, we demonstrated how our method can be adapted to online learning of data and computing the probability of a fetus being healthy in real-time.
The merits of non-parametric Bayesian models, as shown in our experiments, are being free from parameter-tuning and model selection. In addition, the experiment results suggested that our methods were able to accurately model the FHR data. On the other hand, the Chinese restaurant franchise with finite capacity models are able to process data of a fixed length sequentially. Therefore, if applied in real-world scenarios, the CRFC model can evaluate the FHR data with time and provide the physicians with real-time estimates of the fetal status. However, in our experiments, since the true online fetal health information was unavailable, we were unable to validate how our method performed.