Segmenting accelerometer data from daily life with unsupervised machine learning

Purpose Accelerometers are increasingly used to obtain valuable descriptors of physical activity for health research. The cut-points approach to segment accelerometer data is widely used in physical activity research but requires resource expensive calibration studies and does not make it easy to explore the information that can be gained for a variety of raw data metrics. To address these limitations, we present a data-driven approach for segmenting and clustering the accelerometer data using unsupervised machine learning. Methods The data used came from five hundred fourteen-year-old participants from the Millennium cohort study who wore an accelerometer (GENEActiv) on their wrist on one weekday and one weekend day. A Hidden Semi-Markov Model (HSMM), configured to identify a maximum of ten behavioral states from five second averaged acceleration with and without addition of x, y, and z-angles, was used for segmenting and clustering of the data. A cut-points approach was used as comparison. Results Time spent in behavioral states with or without angle metrics constituted eight and five principal components to reach 95% explained variance, respectively; in comparison four components were identified with the cut-points approach. In the HSMM with acceleration and angle as input, the distributions for acceleration in the states showed similar groupings as the cut-points categories, while more variety was seen in the distribution of angles. Conclusion Our unsupervised classification approach learns a construct of human behavior based on the data it observes, without the need for resource expensive calibration studies, has the ability to combine multiple data metrics, and offers a higher dimensional description of physical behavior. States are interpretable from the distributions of observations and by their duration.


Introduction
Accelerometers are increasingly used for studying daily physical activity. A common technique to process accelerometer data is the so called 'cut-points' approach. This approach allows calculation of time spent with the acceleration registered by the accelerometer between certain thresholds to define physical activity intensity levels (sedentary, light, moderate, vigorous), at different bouts duration [1]. The threshold values used in this approach are calibrated relative to an indirect calorimetry derived metabolic equivalent (MET) level which, in brief, is a proxy for energy expenditure relative to rest [2,3]. Results from the cut-points approach are easy to report and reproduce. However, this approach comes with several challenges. Firstly, a known challenge is the complex relationships of acceleration with energy expenditure, activity types, study populations, and study designs, make that cut-points easily overfit to the experimental conditions under which they are derived. Secondly, the approach involves many parameters, such as bout length, that are often chosen without a clear exercise physiological motivation.
Thirdly, the cut-points approach leads to collinearity between classes which partly result from the compositional nature of the data [4] and partly from causal relations between behaviors [5].
This collinearity complicates the study of interactions between behavioral categories [6].
The cut-point approach traditionally used the magnitude of acceleration as its input. The orientation of the accelerometer under static conditions has also emerged as an additional informative metric to detect human posture [7], more recently for the detection of sleep [8], and in the Sedentary Sphere method for detecting sitting behavior and visualizing the data [9].
Further, the data can also be explored using automated methods such as machine learning.
Machine learning methods that use labelled data, referred to as supervised machine learning, have previously been used for activity type classification and energy expenditure estimation [10- 4 13]. Although such methods have shown potential for physical activity intensity assessment, they have disadvantages similar to the cut-points approach in that the trained classifier may overfit to the specific experimental conditions under which it was trained. Unsupervised machine learning on the other hand has received less attention in relation to physical activity intensity assessment.
These methods are data-driven, allow identification of the characteristic states in the data, and can be applied to free-living data directly. Note that they are called states rather than categories, because they are defined by a Markov model rather than by absolute thresholds. As a result, they do not require time consuming and expensive calibration studies including a year of work to plan and conduct the study, they do not require costs related to exercise laboratory usage, and they may avoid arbitrary decisions in the design of the cut-point approach.
We hypothesize that if such a data-driven approach to segment the data is only provided with input data that has a known physiological meaning like the magnitude of acceleration, it may be possible to learn physiologically meaningful segments in the data. If successful, this would overcome limitations of the methods requiring population sample calibration, such as the cut-point approach. Our primary aim is to implement HSMM to identify states characterized by the intensity of the activity undertaken. There is currently no gold standard method for categorising activity intensity. We therefore assess the comparability of our new approach with the traditional cut-points approach, by looking at collinearity between time spent across different activity intensity states and between cut-point categories. Large collinearity between intensity states or categories complicates modeling behavioral interactions. We will investigate this collinearity with correlation analysis and principal component analysis [6,[14][15][16] . Further, we examined the plausibility of the relation between resulting HSMM-defined activity intensity states, cut-points categories and time-use diary records. A supplementary aim was to assess the 5 influence of adding accelerometer orientation metrics. In this study we used data from 500 fourteen-year-old participants. Written consent was first required from a parent/guardian, and verbal consent from the cohort member [17,19]. A random subset of 500 participants was selected for the present analysis.
Millenium Cohort Sixth Sweep data, protocols and metadata are available to the scientific community, under DOI 10.5255/UKDA-SN-8156-3.

Study protocol
Interviewers placed tri-axial accelerometers (GENEActiv) with respondents during home visits and requested them to wear the device on their non-dominant wrist for two complete days; one during the week and one at the weekend, randomly selected at time of placement. Each day lasted for 24 hours: from 4 am in the morning to 4am the following morning. Participants received text messages reminding them to complete the tasks on the selected days. Upon completion of the second day of activity data collection respondents were required to return the accelerometer in a pre-paid envelope. The accelerometer sample frequency was set at 40 Hertz 6 and the dynamic range was ±8g. The orientation of the acceleration axes, seen from the anatomical position, is as follows: the x-axis points in medio-lateral direction (direction of thumb), the y-axis in longitudinal direction (direction of middle finger), and the z-axis in dorsalventral direction (perpendicular to skin). For further details on the study design and protocol see [18,19]. Additionally, participants were asked to record their categories of behavior in a time use diary [17]. Participants were asked to provide a full record of what they did on the two days (activities), from 4am to 4am the next day, as well as where they were, who they were with, and how much they liked each activityusing pre-coded lists. Participants were offered the choice between an app and online version (with a paper version available for those unable or unwilling to use one of the other two).

Accelerometer data pre-processing
The raw data from the accelerometer is processed with R package GGIR [20] which extracts the two days on which the accelerometer was supposed to be worn (days defined from 4am to 4am). Next, it estimates calibration error based on static periods in the data and corrected if necessary [21], and estimates accelerometer non wear time using a previously reported algorithm [22,23]. The main metrics extracted from the data are five second average of the . For all continuous time periods with no z-angle change of more than 5 degrees lasting at least 5 minutes, the acceleration values were set to zero to take out the possible influence of increased calibration error during sustained inactivity periods, e.g. as a result of temperature [21].
To account for variation in sign of the signal as a result of wearing the accelerometer upside down, the angles were corrected as follows: If the value for the x-angle has a positive median during all time periods detected as active (calculation described in the next section) then the device is considered to be worn incorrectly, in that case, the x-angle and y-angle (flipped around zero) are negated to mirror the orientation.

Conventional cut-points approach
The acceleration magnitude and angle-z metric are used to assign each of the 5-second epochs to one of the following ten exclusive categories: • Sustained inactivity: all continuous time periods with no z-angle change of more than 5 degrees lasting at least 5 minutes [8]; • Inactivity: defined, outside identified sustained inactivity, as acceleration below 40 mg, divided into periods of inactivity lasting less than 10 minutes, bouts of inactivity lasting between 10 and 29.9 minutes, and bouts of inactivity for at least 30 minutes; 8 • Light physical activity (LPA): defined as acceleration between 40 and 120 mg, subdivided into spontaneous LPA lasting less than 1 minute, bouts LPA lasting between 1 and 9.9 minutes, and bouts LPA lasting at least 10 minutes; • Moderate or vigorous physical activity (MVPA): defined as acceleration above 120 mg, subdivided into spontaneous MVPA lasting less than 1 minute, bouts MVPA lasting between 1 and 9.9 minutes, and bouts MVPA lasting at least 10 minutes.
In order to account for natural variations into acceleration values inside one activity bout, bouts were calculated so that short interruptions accumulate to no more than 20% of the bout length In the absence of validated cut-points for our population, we used the reported acceleration values across activity types in children and adults from a study by Hildebrand and colleagues [24]. The choice for three intensity levels, inactivity, light and MVPA, is widely used in the physical activity research community. The sub-classification of these levels in bout durations, was guided by the common practice to look for bouts of at least 10 minute of MVPA [1], and the common practice of looking for bouts of at least 30 minutes inactivity or sedentary behavior [25]. The sustained inactivity category has been shown to be a proxy for sleeping time in adults [8], but could generally be interpreted as time segments without movement and rotation. 9

Hidden semi-Markov models
The goal of using an unsupervised method is to segment the data in time periods that can be clustered into segments with similar behavior. Hidden Markov Models (HMMs) as used by others for accelerometer data [26,27] do not model the duration of the state. The related Hidden Semi-Markov Models (HSMM) have the advantage of modelling time distribution for the different behavioral states. HSMM have proven to be valuable for similar segmentation tasks in ubiquitous computing [28,29]. In a Hidden semi-Markov model (HSMM) clustering is performed into hidden states. The word hidden is used because the states are not directly observed, but found by the algorithm. Further, the abstract word states is used for the data clusters because we do not know (yet) what physical activity intensity category they represent.
However, in practice a state can be interpreted as a physical activity intensity category, with its own characteristic distributions of orientation and acceleration values (the observations that are not hidden) and a characteristic distribution of duration. A graphical representation of the HSMM is visualized in Fig 1. The HSMM is an extension of the widely used Hidden Markov Model [30]. The difference with traditional Hidden Markov models is an explicit distribution for duration of the state. This duration, sometimes called sojourn time, is the number of time steps that the model resides in one state before transitioning to the next state.

The observations (acceleration and orientation values) are modelled as Multivariate
Gaussian distributions, where each state holds its own mean and variance parameters. The durations are modelled as discrete Poisson distribution, where each state holds its own lambda parameter. In addition, there is a transition probability matrix that indicates how likely it is from each state to transition to each other state. 10 The parameters of the model (observation distribution parameters, duration distribution parameters, transition matrix), are learned in a Bayesian manner, with a Hierarchical Dirichlet Process HSMM (HDP-HSMM) as presented in [31]. In Bayesian parameter learning, all parameters are represented as prior distributions, and are updated based on the data (Bayesian inference). The specific inference method used for the HDP-HSMM is Gibbs sampling with weak limit approximate algorithm. In Gibbs sampling, each of the parameters is updated iteratively, by sampling from the conditional probabilities for that parameter, based on the current estimated distributions.
In the HDP-HSMM, the transition probabilities between the states are represented as a Dirchlet Process for, in principle, an infinite number of states. However, the number of states is not actually going to be infinite for the following reasons: the Dirichlet Processes share a common parameter that causes the model to favor a small number of states; the weak limit approximate algorithm assumes a maximum number of states (needs to be provided by the user), and the actual number of states is inferred by the model with states being dropped if there are no transitions going in and out of them. Therefore, the number of states can be smaller than the maximum number of states defined by the user.
The forward-backward algorithm calculates the distribution over states, conditioned on the observed data and all model parameters. It serves as a step in the Gibbs sampling, and is used to determine optimal state sequence for a given observation sequence. In the forward pass of this algorithm, the initial state assignment and duration is randomly sampled from the distribution.
Note that this random sampling can result in slightly different state assignments in multiple runs of the algorithm. 11 2.5.1 HSMM parameters There are several choices that influence the runtime of training the model. Since the forward-backward algorithm is used in each iteration of the training with Gibbs sampling, the complexity of this algorithm contributes to the total training time. The forward-backward algorithm has a runtime of ( • • + • 2 ) [31]. In this formula, is the length of the sequence, is the number of states and is a maximum chosen duration length.
The maximum state duration is a user defined input to control training time. We set at 720 five second epochs, which corresponds to 60 minutes.
The maximum number of states is an input for the weak limit algorithm. Although the number of states is inferred by the algorithm, it can be useful to limit the number of states. It is more computationally efficient to have a small number of states, and easier to interpret the resulting states. = 10 was evaluated, similar to the number of cut-points categories.
The number of iterations of Gibbs sampling needs to be chosen so that the algorithm converges to a stable parameter set. Early stopping is introduced when the hamming distance between the assigned states of two consecutive iterations is smaller than 0.05. In other words, convergence is reached if not more than 5% of the time steps are assigned a different state than in the previous iteration. Further, we chose a maximum of 15 iterations.
Lastly, the metrics that are used as observations in the model can be varied. We experimented with two models with different observations as input. The first model used only acceleration, the second model used acceleration together with , and . We will further refer to the resulting models as the acceleration model and the acceleration+angles model, respectively. The information used acceleration model is most comparable to the cut-points approach, since the cut-points approach only uses angle for one out of the ten categories. It is not possible in HSMM to instruct the model to use specific variables for only a single state. 12 For the implementation of the Bayesian HSMMs the python package pyhsmm (https://github.com/mattjj/pyhsmm) is used, the code is available on Github ( [32] and [33]).

Evaluation
In the following sections, we use the term 'categories' to distinguish those slices for the cutpoints approach and we use the term 'states' to discriminate between those slices for the HSMM approach. 13 Time spent in each state per day and cut-points category was calculated for participants with full the more components are needed to explain the variance of the data, the less colinear the variables are. As an alternative way of looking at information dimensionality we also looked at the cross correlation between states and between cut-point categories.
Next, to assess comparability of the cut-points and HSMM approaches, correlation coefficients were calculated between time spent in states and cut-points categories, grouped by acceleration level. We expect that the states come with a plausible variation with respect to the conventional method output. To investigate the differences, a descriptive comparison was done of HSMM states, acceleration values, angle values, cut-points categories, and time use diary records. To ease interpretation, only days of measurement with full 24 hours of valid data (no accelerometer non-wear time) were considered for the descriptive comparison [22,34]. We will focus mostly on the HSMM models using acceleration+angles since we expect the addition of angle variables to give extra insights, acceleration only results will be reported in the supplement.
We choose a sample size of 500 participants for two reasons. First, we would like to demonstrate that the HSMM also works in relatively small samples, such that it can be applied in wide range 14 of study conditions not limited to larger cohorts. Cut-points are typically derived from laboratory studies with less than 100 participants by which 500 participants is still a large sample size.
Second, adding more data would increase the computing time. To evaluate that a subsample can generalize to a larger population we tested the reproducibility. which is an information-theoretic measure to calculate the distance between two statistical distributions [35]. The Kullback-Leibler divergence is denoted with KL(P|Q), where P is the distribution of the observed values as learned by the original model on 500 participants, and Q the corresponding distribution for the subset model. 15

Results
A total of 9122 participants accepted to wear the accelerometer, 4970 participants returned the accelerometer and time use diary, out of which a random subsample of data from 500 participants was used for the present study. The demographics of this sample are shown in Table   1. Note that the data of all 500 participants, regardless of whether the data was complete, could be used for training the HSMM model.. The demographics of the participants who wore the accelerometer for 24 hours on both days are also shown in Table 1. In the calculation of the acceleration values on average 31% (standard deviation 5%) of the of the epochs were replaced by zero.

Correlation analyses and PCA
Ten different states were found by the HSMM training for both the models using acceleration   Table 22 we see how much time the participants spent on average in each cutpoints category and each state. A detailed description of the states in the acceleration+angles model is given in the supplement. We also compare the values in Table 2 for parallels between 19 states and cut-points categories. The same can be done for the acceleration model (available in S2

Fig 4. Boxplots for acceleration and angle values per HSMM state (acceleration+angles model, left column) and cut-points category (right column)
The model states and cut-points categories can both be grouped in combinations of respectively states and categories that have similar acceleration levels, corresponding to sustained inactivity, inactivity, LPA and MVPA, see Table 2

Comparison against time use diary
To assess whether there is a relation between the states and activities reported in the time use diary, we focused on the 10 most common activities reported in the time use diary, see Table 3 for a comparison with the acceleration+angles model (the same table for the acceleration model is available as S3 Table in

Reproducibility
The  Table. 24 The HSMM has this advantage of interpretability over other approaches, such as Deep Neural Networks. We intentionally do not use other signal features, because that would introduce the risk that the HSMM model detects activity types rather than intensity. Activity type is a different dimension of physical activity and mistakenly classifying types would undermine interpretation.
Therefore, our comparison with the time use diary should be interpreted with care as time use categories do not only reflect variations in intensity.
Our findings show that the HSMM derived states were related to cut-points categories.
For example, the HSMM found short lasting states with high acceleration and long lasting states with low acceleration, which is consistent with data derived from cut-points approaches. Further, when states and cut-points categories are grouped by acceleration level, correlations of 0.56 and 26 higher are observed between th e two approaches. The mean acceleration for some of the states was close to the threshold value used in the cut-points approach, which indicates disagreement on the range or distribution of acceleration for typical behaviors. The distributions of durations in the HSMM states is also different from the cut-point categories. However, these differences may well be explained by the fact that the HSMM is driven by the distributions in observations (accelerometer recording) from data collected outside a laboratory in the daily life of British teenagers, while the cut-points approach in our case was driven by cut-points derived from Norwegian children and adults performing a very specific set of activity types in a laboratory setting. By showing most states to last less than 17 minutes, our results contribute to the debate about current practice to quantify physical activity and inactivity in ten minute bouts [43] and thirty minute bouts respectively [44]. Consequently, it may not be surprising that no perfect agreement is found between the cut-points approach and the HSMM approach. More importantly, the HSMM covers a plausible range of acceleration levels (low, medium, and high), durations (from less than a minute to more than 30 minutes), and to some extent, although difficult to interpret, angle ranges. Further, it was reassuring to observe that the principal component analyses applied to time spent in states has a less steep scree plot compared to time use variables based on the cut-points approach. This indicates that research on interactions between behaviors will be less challenged by collinearity. An important strength of the HSMM is that it can account for multi-variate input, even if no prior theory exists on why or how the additional input variables could contribute. In contrast, the cut-points approach needs such a theory [45]. Our PCA results indicates that adding the angles offers a description of physical activity with less collinearity and possibly higher dimensions compared to using acceleration only. 27 The HSMM approach allows us to move towards a description of physical behavior based on, and driven by, the accelerometer data that can feasibly be collected in both small and large scale populations. The HSMM is not biased by the subjective nature of self-report methods, avoids the complexities of accounting for inter-individual variation in body composition in energy expenditure estimation and the variation in the relationship between body composition and energy expenditure between activity types, and avoids the difficulties with generalizability of supervised learning techniques that rely on training data composed of small numbers of participants and/or activity types. The HSMM approach will not directly fit into the research framework aimed at providing public recommendation on layman constructs like steps or minutes of moderate to vigorous physical activity per day. However, HSMM may speed up and facilitate a data driven approach that could help to understand how variations in activity characteristics, as measured by acceleration and arm angle, relate to health and disease.

Strengths and limitations
The agreements between time spent in the HSMM states and time use diary categories were poor. Reasons for this are likely to rely on the fact that the time use diary collects broad information on activity type (10 items) and activity context using low time resolution (10-minute slot) in comparison to the physical activity intensity construct as measured by the cut-points approach. Nevertheless, the time use diary allowed us to evaluate how these two constructs relate to one another. A challenge in the design of the present study was that there exists no gold standard for intensity profiling of physical activity in a real life setting in a representative population. Therefore, we combined a variety of analyses to assess the comparability of HSMM and cut-points approach from different perspectives. Future studies might complement the 28 evaluation of models trained on real life data, with data from lab studies, where activities and energy expenditure can be directly measured with indirect calorimetry [24,40,46]. The states found by the HSMM are based on patterns in the data and are not specific for the research question of a user. Therefore, when using the HSMM states as physical behavior descriptors in further research, it might be good practice to undertake post-processing of the data. This includes, for example, grouping states that have similar characteristics regarding the research question (e.g. similar acceleration level), or using the majority state for a larger window to adjust for the desired time granularity.
In practice, it seems not to be feasible to let the model converge to very consistent state assignment (e.g. < 1% Hamming distance). It is not clear from theory how much of this is attributed to variance in the data, and how much could be gained with more training iterations.
Therefore, future research (both empirical and theoretical) is needed to investigate the relationship between data size, population characteristics, and convergence.
Future research is needed to better understand the application of HSMMs on physical activity data. For example, the number of states can be varied to optimize face validity, while retaining interpretability and feasibility in terms of training time. The same holds for the size of the input time frame: instead of 5-second time frames, smaller or larger time frames can be used as input.
It is also possible to include more input metrics in the model, although that may also complicate the interpretation of the states. In the present study we limited the number of metrics to facilitate a standardized comparison with the cut-points approach and to facilitate interpretation. The use of the z-angle for sustained inactivity detection in the cut-points approach does not undermine the standardized comparison, because the HSMM model also uses this information: When calculating the magnitude of acceleration that is used as input for the HSMM model, values are 29 replaced by zero when the z-angle is constant for a five minutes. The use of different distributions to represent the data in the HSMM model could be investigated, such as a lognormal distribution for the acceleration metric.
The use of metrics that describe the orientation of the device imposes challenges. Firstly, the interpretation of the distributions of the angle values is difficult, asking for visualization and comparison to specific activities to distinguish between states with different angle distributions.
Secondly, the correct wear position of the device becomes crucial. In this work, we took a heuristic approach at correcting for improper worn devices. Future research is needed to build a reliable classifier that determines the wear position if signal metrics are used that are body side dependent [47].
Separation of gravitational and calibration of acceleration sensor data is a known challenge [21,22] and the estimates of acceleration are not entirely free from calibration error. By replacing the magnitude of acceleration by zero for the time segments where the accelerometer does not change orientation we ensure that: calibration error as a function of accelerometer orientation does not influence the segmentation of the acceleration data; the contribution of white signal noise to the magnitude of acceleration is minimized, and; bias caused by calibration error is as close to zero across the recording.
We chose a random subset of 500 participants in this study, because we wanted to demonstrate that the HSMM method works in relatively small data sets. Suitability for small datasets is important for uncommon study populations, including the very old, rare diseases, and populations in hard to reach rural areas. The reproducibility experiment suggests that the model for a smaller subset of 250 approaches the model trained on the data of all 500 participants, except for the rarest states. This provides us with confidence that the model generalizes well to more data from the same target group and is not overfitted to the data it is trained on. However, the question remains how much the model generalizes to other populations, e.g. age groups or countries. If enough data are available, it is also possible to train the model for a specific person; this would however make it more difficult to relate the resulting states among the participants.
This is a problem of finding the right balance between a model optimized in specific population, and a model that representative for the general population.
The cut-points approach has known limitations, but despite these limitations it has been of tremendous value to the physical activity research community for decades. Therefore, we felt it important to aim for comparability with the traditional approach, while at the same time trying to address one of its limitations. Future analyses to compare the associations of time spent in categories assessed by the cut-point approach and states assessed by the HSMM with health outcomes in different population setting will allow to assess the value of the HSMM approach for research. We conclude that applying Hidden Semi-Markov Models results in informative states, based on the data from a real-world setting. It is possible to relate the states to conventional cut-points categories, to interpret the meaning of the states. The unsupervised model can easily incorporate multiple input metrics, so that the states provide a higher dimensional description of physical behavior.