Extraction of features from sleep EEG for Bayesian assessment of brain development

Brain development can be evaluated by experts analysing age-related patterns in sleep electroencephalograms (EEG). Natural variations in the patterns, noise, and artefacts affect the evaluation accuracy as well as experts’ agreement. The knowledge of predictive posterior distribution allows experts to estimate confidence intervals within which decisions are distributed. Bayesian approach to probabilistic inference has provided accurate estimates of intervals of interest. In this paper we propose a new feature extraction technique for Bayesian assessment and estimation of predictive distribution in a case of newborn brain development assessment. The new EEG features are verified within the Bayesian framework on a large EEG data set including 1,100 recordings made from newborns in 10 age groups. The proposed features are highly correlated with brain maturation and their use increases the assessment accuracy.

where y is the predicted class y ∈ {1, C}, x = (x 1 , . . . , x m ) is the m-dimensional input vector, and D = (x (i) , y i ) n i=1 , is the set of n labelled (or training) data points.
This integral is analytically tractable only in simple cases. The posterior density of Θ conditioned on D, p(Θ|D), can be also evaluated in very simple cases. Using MCMC integration, we can draw Θ (1) . . . , Θ (i) from the posterior distribution of Θ|D, and then write where N is the number of samples that are taken from the posterior distribution. The above integration is achieved when we can generate random samples from Θ|D by running a Markov Chain having a stationary distribution. Starting from an arbitrary state, the Markov chain has to converge to a stationary one during the burn-in phase. Then during the post burn-in phase samples from the Markov Chain are collected to calculate the desired posterior density.
MCMC integration over models of variable dimension of Θ is achieved by using the Reversible Jump extension described in [1].
In order to use DT models for Bayesian averaging, first we need to determine the probability ϕ ij with which the ith terminal node, i = (1, k), assigns an input x to the jth class, where k is the number of terminal nodes in the DT.
Then we can define DT parameters as a vector Θ = (s pos i , s var i , s rule i ), i = 1, . . . , k − 1. Here, s pos i denotes the ith node position in the DT, s var i defines the variable used at the node for splitting, and s rule i specifies the threshold value of this variable.
The above notations allow us to define the following priors on DT models. Given a number of the minimal data points, p min , that are allowed in splitting nodes, first we expect that a maximal number of splitting nodes s max = n/p min , where n is the number of data points in the D. Then we can assign node i to be drawn from a uniform distribution, s pos Similarly, we can assign any of the m variables drawn from a uniform discrete distribution, s var i ∼ U (1, m). Finally, we can assign the threshold value to be drawn from a uniform discrete distribution, s rule i ∼ U (z (1) , . . . , z (K) ), where K is the number of possible splitting rules for predictor z = s var i . The above priors allow us to generate DT models of different configurations which split the data in many ways. Ideally, each DT with the same number of terminal nodes can be explored equally likely [2]. For this case the prior for a DT is written as In some cases, there is the knowledge of the favoured DT configuration. For example, the prior probability of further splits of the terminal nodes can be defined dependent on how many splits have already been made above them, see e.g. [3]. The use of such information allows MH algorithm explores the favoured DT models in more details.
In practice, DT models are typically grown to a large size. The MCMC integration over large DT models of variable dimensionality is performed with the following types of moves, see e.g. [1, 3, 2]: 1. Birth. Randomly split the data points falling in one of the terminal nodes by a new splitting node with the variable and rule drawn from the given priors.

2.
Death. Randomly pick a splitting node with two terminal nodes and assign it to be a terminal one with the united data points.
3. Change-split. Randomly pick a splitting node and assign it a new splitting variable and rule drawn from the given priors.

4.
Change-rule. Randomly pick a splitting node and assign it a new rule drawn from a given prior.
The first two moves, birth and death, are reversible and change the dimensionality of the vector Θ. The remaining moves provide jumps within the current dimensionality of Θ. The change-split move is included to make "large" jumps which potentially increase the chance to sample from maximal posterior, whilst the change-rule move makes "local" jumps to allow the neighbourhood areas of interest to be explored in detail.
For the birth moves, the proposal ratio R is written as where q(Θ |Θ) is the proposal distributions of moving from the proposed parameters Θ to the current Θ, q(Θ |Θ) is the probability of the reverse move, Θ and Θ are (k + 1) and k-dimensional vectors of DT parameters, respectively, and p(Θ), p(Θ ) are the probabilities of the DT models with parameters Θ: where N (s var i ) is the number of values of s var i which can be assigned as a new splitting rule, S k is the number of ways of constructing a DT with k terminal nodes, and K is the maximal number of terminal nodes, K = s max + 1.
For binary DT models, the number S k is defined by the Catalan number and we can see that for k ≥ 25 the number of possible DT models becomes very large, S k ≥ 4.8 12 .
The above proposal distributions are defined as follows: where D Q1 = D Q + 1 is the number of splitting nodes whose branches are both terminal nodes. Finally, the ratio R b for a birth move is written as The number D Q1 is dependent on the DT structure and so expected that D Q1 < k, k = 1, . . . , K.
In particular, when k = 2, . . . , K-1, we can assume d k+1 = b k . Then k → K, DT models are growing, and S k+1 > S k , so that a proposal ration R becomes between 0 and 1.
Alternatively, for the death moves the ratio R d is Because DT models are hierarchical structures, changes to the nodes located close to the root can cause drastic changes to the distribution of data points at the lower levels. For this reason there is a very small chance to accept the change in a node near the DT root. This means that MH algorithm will tend to explore the DT models in which the nodes chosen to be changed are far from the root node. These nodes typically contain small numbers of data points, and consequently, the value of likelihood is not changed much, and such moves are mostly accepted.
The negative impact of this is that MH algorithm cannot explore the full posterior distribution of DT models, see e.g. [2]. The DT models grow very quickly during the first burn-in samples because an increase in the log likelihood value for the birth moves is much larger than that for the other moves, and almost every birth move is accepted.
Once a DT has grown, the changes in nodes near to the DT root have very small acceptance probability, and MH algorithm tends to get stuck in particular DT structures.
A new prior suggested in [4] was capable of controlling the overgrowth of DT models by attempting to reduce the probability of proposing excessive birth moves for a given split. Using this prior, MH algorithm assigns a proposal from a uniform distribution whose parameters are estimated on data points that fall into the split. The use of the new prior has improved the results of Bayesian averaging over DT models on the benchmark data sets as well as on the realworld problems.