Validation and Reliability of a Classification Method to Measure the Time Spent Performing Different Activities

The aim of this study was to validate the performance and reliability of results obtained from a classification model that measures time spent performing activities in confined (CE) and unrestricted (UE) environments. In CE, participants wore a pair of biaxial and/or triaxial accelerometers while performing pre-determined training activities classified as variants of lying down, dynamic standing, sitting, walking and running on two separate days. A classification model trained with activities performed in a specific order during the first day was developed to validate the activities performed in a random order on the second day (CE) and over 24 hours on a separate day (UE). The performance of the classification model was validated against triaxial accelerometers using six (x, y and step counts for arm and thigh) or eight (same as six features plus z axis) features. The reliability of the classification model was tested in both environments using six features. Results revealed an overall accuracy of 94% in CE and 90% in UE. The sensitivity in CE and UE was 94% and 95% for lying down, 88% and 80% for dynamic standing, 97% and 89% for sitting, 96% and 78% for walking and 90% and 64% for running, respectively. No significant differences were noted between performances obtained with six or eight features. Results were highly reproducible in both environments. The results obtained from the classification model were accurate and reproducible, and highlight the potential use of this approach in research to quantify the time spent performing different activities.


Introduction
Activity monitoring systems are used to estimate energy expenditure using data captured by accelerometers and other sensors. They have been widely used due to their small size, low cost, and low power consumption [1,2]. Nevertheless, the measurement of energy expenditure does not allow the characterization of the different activities performed during a determined time frame. As reviewed by Preece et al (2009) and Yang & Hsu (2010), activities performed can be computed from raw accelerometry data using classification models that are obtained from machine learning classifiers (e.g., decision trees, neural networks, Bayesian classifiers, support vector machines and others).
The validation of classification models aimed at recognizing activities has been conducted in a confined environment (CE), showing a high accuracy [3][4][5][6]. Some studies have also been performed in a semi-unsupervised environment showing similar results [7][8][9][10][11][12][13]. Under a free unsupervised period of 4 hours, Ermes [14] showed a sensitivity (chances of classifying an activity as positive when it is indeed positive) of 98% for lying down, 80% for sitting/standing and 30% for walking when four annotated activities were considered over a total of nine recognized activities. Long et al. also demonstrated a sensitivity of 80% for walking and 93% for running when participants annotated five activities over a 10-hour period when using one accelerometer [15]. However, to our knowledge, no study has validated the results obtained from a classification model over 24 hours in an unrestricted environment (UE). One of the reasons is that the internal memory of devices is limited in size and quickly fills up when data are sampled at a high frequency. Nevertheless, recognizing only major categories of activities (i.e., lying down, dynamic standing, sitting and walking) does not require a high data-sampling frequency due to the nature of these activities and thus makes the recording over a longer period of time possible.
The first objective of this study was to build a classification model for biaxial and triaxial accelerometers and to validate the performance of this classification model in discriminating five different activities. Specifically, the performance was validated using data gathered under 2 hours in CE and under 24 hours in UE. The second objective was to compare the performance of the classification model using a set of six (x, y and step counts for arm and thigh) or eight (x, y, z and step counts for arm and thigh) features recorded from the triaxial accelerometers. The third objective was to assess the reliability of the results acquired from the classification model obtained with biaxial and triaxial using six features under CE and UE. We hypothesized that (a) the results would be highly accurate in both environments using biaxial or triaxial accelerometers (b) the performance of the classification model obtained with six or eight features would be similar and (c) the results obtained from the classification model would be highly reproducible in both the CE and UE.

Materials and Methods Participants
A total of seventeen males and nineteen females students were recruited to complete this set of experiments. The inclusion criteria were as follows (a) over the age of 18 years; (b) stable weight (±2 kg) within the past six months; (c) nonsmokers; (d) no drug or alcohol abuse; and (e) without any orthopedic limitation. All experiments were conducted according to the guidelines laid down in the Declaration of Helsinki and all the procedures involving human participants were approved by the University of Ottawa ethics committees. Written informed consent was obtained from all participants.

Accelerometers
A pair of biaxial and/or triaxial activity-monitoring systems (accelerometers) (SenseWear Pro 3 Armbands, HealthWear Bodymedia, Pittsburgh, PA) were used. SenseWear Pro 3 Armbands were chosen because they provide access to raw data (acceleration axes and step counts) and provide accurate estimates of energy expenditure [16]. One accelerometer was placed around the upper arm (midway between the acromion and the olecranon) while the other was placed around the thigh (midway between the patella and the inguinal fold; on the exterior of thigh). The internal clocks of both accelerometers were synchronized before the beginning of each session with the researcher's watch or with the participants' watch. The data recorded over time were the following features: x and y acceleration axes and the step counts (for the arm and thigh) while using a biaxial accelerometer and the x, y and z acceleration axes and the step counts (for the arm and thigh) while using a triaxial accelerometers. Therefore, the biaxial accelerometers provided six features while the triaxial accelerometers provided eight features. In terms of anatomic axes, the x, y and z axes represent the horizontal, vertical/frontal and sagittal axes, respectively. The acceleration measures were accumulated and averaged over a period of 5s while the step count measures are accumulated and averaged over a one minute interval. The combination of the accelerations and step counts recorded every five seconds refer to one data sample. Note that the step counts measure was stable during one minute while the accelerations were different every five seconds.

General Procedures of the Study
This study consisted of four experiments (Fig 1)  Experiment I-building the Classification Model. The data (acceleration axes and step counts) were obtained from accelerometers worn by participants performing 22 predetermined training activities classified as variants of lying down, dynamic standing, sitting, walking, running, biking, and climbing stairs in a specific order. The procedures were performed under the supervision of the researcher who recorded the beginning and end of each activity.
The INNERVIEW software (version 4.02; Bodymedia, Pittsburgh, PA) was used to extract the data obtained from the 22 predetermined training activities (training data) from the accelerometers (Fig 2A). Training data were exported in two Comma-Separated Values (CSV) files: one file for the accelerometer worn on the arm and one for the accelerometer worn on the thigh. Activity Recognition software was used to combine and to synchronize these two training data files, which produced a single file containing a sequence of training data samples. The associated activity for each sample was then identified based on the recording time. Transitions from one activity to another were manually removed from the training data set. Two classification models (support vector machines, kernel type: radial basis function; cost: 10; gamma parameter: 0.01) were then built using those training samples (recorded features and known activity): one classification model with the biaxial accelerometers (Experiment 1: Biaxial) and one with the triaxial accelerometers (Experiment 1: Triaxial). The Activity Recognition software uses the SVM implementation of the open source software library Java-ML as a classification algorithm [17]. To facilitate the discrimination between variants of dynamic standing and walking, a threshold of 30 steps per minute or less was used. The threshold was applied during data pre-processing in the training phase. If the step count for one data sample was lower than 30 steps per minute, it was assigned a value of 0 steps per seconds, and then fed to the classifier as a training sample. The rationale for using a 30 step counts per minute threshold is based on the reasoning that dynamic standing could be associated with minor lower body movement at low speeds for short distances (which is equivalent to one step every two second or less).
Experiment II-validating the performance of the Classification Model. To validate the performance of the classification model under CE (Experiment 2: Biaxial_CE and Experiment 2: Triaxial_CE), participants were asked to perform the same 22 predetermined training activities in a different order, which was different for each participant. After initial analyses, low accuracy of biking and climbing stairs (i.e., 37% for climbing the stairs and 74% for biking in Experiment 2: Biaxial_CE) were obtained. Therefore, these activities were removed from the classification models and were not further classified as part of this study. The classification model therefore classified climbing stairs or biking as either walking or running. During the validation of the UE (Experiment 2: Biaxial_UE and experiment 2: Triaxial_UE), participants   were asked to mark down on a sheet of paper five categories of activities (lying down, dynamic standing, sitting, walking or running) with the start and finish times (precision within one second) over a 24-hour period. Dynamic standing was described to the participants as a static standing position that could include dynamic movement of the upper body. Since it could be associated with minor lower body movement, walking for short distances (less than 30 steps per minute) was also considered as dynamic standing. Examples include meal preparation, washing dishes, talking to someone while standing, etc. Walking was categorized as a displacement of more than 30 consecutive steps per minute. Examples include walking to work, walking to the bus stop, walking the dog, etc. Each data sample from Experiment II was obtained as previously described (i.e., INNERVIEW software, CSV, Activities Recognition Software) and classified as an activity either by the biaxial or triaxial classification model (Fig 2B). When participants were doing other types of activities, the latter were classified as one of the selected activities. The Activity Recognition Software was used to coordinate this sequence. The total time spent performing each activity was determined as the product of the sampling rate (5 s) and the number of occurrences of the different activities. The classification was then compiled in a confusion matrix to determine the validity of the results obtained from the classification model. Under UE, participants were instructed to remove the accelerometer during all water activity, including bathing, but to wear it overnight.
Experiment III-validating the performance of the Classification Model with triaxial accelerometer when using six or eight features. For this experiment, data samples obtained from six (i.e., x, y axes and step counts times two accelerometers) and eight features (i.e., x, y and z axes and step counts times two accelerometers) were compared in a CE and in a UE. The results in terms of activity classification were obtained from the same triaxial accelerometers while either including (eight features) or removing the z axis (six features) (Fig 1).
Experiment IV-investigating the reliability of the results obtained from the Classification Model using six features. The reliability of the results obtained from the classification model using six features was investigated. Results of Experiment 2: Biaxial were compared with results of Experiment 2: Triaxial in both CE and UE (Fig 1). The z axis (arm and thigh) from each data sample obtained with the triaxial accelerometers was removed for this analysis.

Statistical Analysis
Statistical analyses were performed in Excel (version 2007). Performance of the classification model was determined with the overall accuracy (i.e., mean proportion of all activities that are correctly classified per person) and using five indicators: sensitivity (chances of classifying an activity as positive when it is indeed positive), the positive predictive value (chances that an activity is indeed positive, when it is classified as positive), the F-Score (the "harmonic mean be- ) and the specificity (chances of classifying an activity as negative when they are truly negative) in a confusion matrix. Cohen's kappa coefficient (measure of the agreement between the real activity and the classifications) was also determined [19]. In order to investigate the difference between the performance of the classification model using six or eight features, a Wilcoxon matched-pairs signed rank test was performed using statistical software (Prism v5, GraphPad Software Inc., San Diego, CA). To investigate the reliability of the results obtained from the classification model when using six features (overall accuracies) in CE and UE, an independent samples ttest was performed. The underlying assumption of normality of the two samples t-test was verified with a normal probability plot performed with Minitab 16. A linear tendency was observed in both plots suggesting that it is reasonable to assume that the accuracy is normally distributed (data not shown). However, since the slopes were very different in CE, the equality of the variance was not assumed. As a result, a t-test with a Welch correction was performed with the GraphPad Prism. Values are presented as percentage ± standard deviation.

Results
The participants' characteristics are described in Table 1.

Phase II-Validating the Performance of the Classification Model in CE
The confusion matrix presented in Table 2shows the real and classified time (s) spent performing activities in CE. The classification model had an overall accuracy of 94±4%, including lying down, dynamic standing, sitting, walking and running. The sensitivity was higher than 90% for all the time spent in activities except for dynamic standing, which had the lowest classification results (88±18%). The positive predictive value was 95±8% for lying down, 95±8% for dynamic standing, 98±3% for sitting, 66±8% for walking, and 88±18% for running. The F-score demonstrated a high overall performance for lying down, dynamic standing, sitting, running, and with the lowest value for walking (76±16%). The high specificity (higher than 99% for most of the activities) suggested that the classification model can accurately detect a specific activity with limited false-positive values. Finally, the association between the real activities and the classification, measured with Cohen's Kappa Coefficient, indicated that the classification model developed in CE highly agrees with the reality (0.93±0.004).

Phase II-Validating the Performance of the Classification Model in UE
The confusion matrix presented in Table 3presents the real and classified time (s) spent performing activities in UE. The classification model had an overall accuracy of 90±4% and a sensitivity that varies between 64 and 95%. Of all activities, lying down and sitting had the highest sensitivity. The positive predictive values were 85±9% for lying down, 76±12% for dynamic standing, 85±6% for sitting, 56±21% for walking and 88±18% for running. Since walking had the lowest sensitivity and positive predictive value, it had an F-Score of 62±18%. The high specificity (between 87 and 100%) and a Cohen's Kappa Coefficient of 0.85±0.001 suggested respectively that the classification model had a low false-positive rate and that there was a high degree of agreement between the reality and the classification.
Phase III-Validating the Performance of the Classification Model with Triaxial Accelerometer when using six or eight Features Table 4presents results of the performance of the triaxial accelerometer when using six or eight features in CE and UE. The difference between both overall accuracies revealed no significant difference in CE (p = 0.81) and UE (p = 1.0).
Specificity (%) d 88±4 9 9 ±2 9 9 ±2 9 9 ±2 9 9 ±1 Phase IV-Investigating the Reliability of the Results Obtained from the Classification Model using six Features The analyses of the reliability of the results obtained from the classification model showed no significant differences for the overall accuracy in CE (p = 0.056) or UE (p = 0.447). The results confirmed with 95% confidence that the difference in the overall accuracies was 6.0% with a maximum error of 6.3% in CE. Similarly, the analyses revealed with 95% confidence that the difference in the overall accuracies was 1.6% with a maximum error of 4.3% in the UE.

Discussion
To our knowledge, this is the first study to validate a classification model to determine the time spent performing activities in UE for a period of 24 hours. Collectively, these results indicate Overall Accuracy (%) a 90±4 Overall Accuracy (%) a 90±4 Kappa e 0.85±0.00 Linear Weight Kappa 0.81±0.00 Quadratic Weighted Kappa 0.77±0.00 a Overall accuracy is the mean proportion of all activities that are correctly classified per person b Sensitivity corresponds to the chances of classifying an activity as positive when it is indeed positive c F-Score is defined as the "harmonic mean between sensitivity and positive predictive values" [18] d Specificity is a measure of chances of classifying an activity as negative when they are truly negative e Cohen's Kappa is the measure of the agreement between the real activity and the classifications f Only three participants had practiced this activity. doi:10.1371/journal.pone.0128299.t003 the relatively high performance of the classification model in CE and UE. Furthermore, the present findings demonstrate that including eight features vs. six features does not increase the performance of the classification model, at least when investigating the five categories of activities presented in this paper. Finally, the results obtained from the classification model showed a high level of reliability when using six features in both CE and UE.
that was done every 2 min in the CE protocol. Nevertheless, when dynamic standing was combined with walking, the sensitivity increased to 98±4%. The number of accelerometers used could also explain slight differences between studies. The classification model shows better accuracy for sitting and dynamic standing as did other studies that used multiple sensors [7,9,21] compared to studies that only used one accelerometer [4,20,22]. Our results as well as those from other studies [20,23] emphasize the importance of using at least two sensors to improve the classification accuracy of sitting and dynamic standing. This is particularly important because these activities constitute a large proportion of daily activity in a modern environment [24].

Phase II-Validating the Performance of the Classification Model in UE
Based on Foester's research, a reduction in the overall accuracy of the classification model would have been expected in CE (95.8%) compared to UE (66.7%) (nine activities) [9]. The overall accuracy obtained from the 24 hours of participants' annotations was only 4% lower, which is less than what has been observed by others [8,14]. A closer inspection of our data revealed that the sensitivity for dynamic standing, walking, and running were the lowest. Nevertheless, after combining dynamic standing and walking, the sensitivity improved to 93±5%. These values are slightly better than those reported by Ermes et al. (2008) for 4 hours of testing when four out of nine activities were annotated by the participants. It is also important to note that the proportion of time was 37% or 8.9 hours for lying down, 12% or 2.9 hours for dynamic standing, 45% or 10.8 hours for sitting, 5% or 1.2 hours for walking and 0.05% or 0.01 hours for running. In this case, even if the proportion of time spent lying down and sitting (82% or 19.7 hours) is high, it represents the percentage of time spent in sedentary behaviors (i.e., lying down and sitting) generally observed in the population [24].
Phase III-Validating the Performance of the Classification Model with Triaxial Accelerometer when using six or eight Features It would seem logical that adding the z acceleration axes should lead to a better activity classification. However, this is not the case since no significant differences were noted between the overall accuracy when including or excluding the z axis of both accelerometers. It should be noted that the step count was part of both features sets. In addition, the activities analyzed were mostly performed in the x and y acceleration axes, which does make the inclusion of a third axis (z axis) rather unnecessary. We can thus conclude from our data that using a classification model that was trained using either six or eight features does not improve classification accuracy under the conditions described in this study.

Phase IV-Investigating the Reliability of the Results Obtained from the Classification Model using six Features
Our results suggest that the classification model obtained in CE and in UE is reproducible. Indeed, the maximum error was.6.3% in CE and 4.3% in UE. A small difference between the internal clocks of both accelerometers and the researcher's watch could have increased the variability across sessions in the CE. Similarly, the maximal error can be explained by the small difference between internal clocks of both accelerometers and the participant's watch in UE.
The complexity and inconvenience related to the exact description of the movement secondby-second by the participant may have been associated to lower annotation compliance and thus may have lead to a certain degree of under-reporting that could have also reduced the reliability of the classification model.

Limitations
Even if the classification model presented and discussed could be considered to have good classification accuracy in both CE and UE, several confounding factors should be considered and identified. Only 2 min in each activity were used to construct the classification model and the transition between the static and dynamic movements was not taken into account. In addition, even with pre-determined training activities classified as variants of lying down, dynamic standing, sitting, walking, and running, more variations of these activities exist and are likely adopted in a real life setting. In this regard, it is important to note that this study initially included stair ascending and descending as well as biking. Because the preliminary validation of the performance of the classification model obtained with biaxial accelerometers in CE gave us a low sensitivity for these activities (i.e., 37% for climbing stairs and 74% for biking), they were not included in the classification model nor were they further investigated. Firstly, the protocol used to measure stairs climbing included 2 min of ascending and descending stairs. Since both patterns are different, the method used was not specific enough for a good classification. Secondly, the sampling rate of 5 s for the accelerations and 1 min for the step count was not high enough to measure biking. It could be hypothesized that a higher time spent doing the activity and a higher sampling frequency would have been helpful in this case. The decision to maintain the sampling frequency was mostly informed by the fact that a higher sampling frequency would have overwhelmed the storage capacities of the devices over longer sampling periods under real life conditions. Finally, the use of an another accelerometer and/or a GPS could have help to measure biking [25].
The results of the present study highlight the high accuracy and reproducibility of both classification models in CE and UE. To the best of our knowledge, no study has investigated and validated several activities under unrestrictive conditions for a period longer than 24 hours. The main reasons that explain this is that the internal memory capacity of sensors is limited and quickly fills up when data are sampled at high frequency. In this study we show that our model, which was developed while using a lower frequency of sampling, has comparable validity to previously published work as far as activity recognition is concerned with the major advantage of being useful for the measurement of several activities that make up for a great proportion of daily life over a much longer duration (up to 7 days). This study also shows that activity recognition models including either 6 or 8 features (i.e. Biaxial vs. Triaxial accelerometers, respectively) are not different in terms of their performance, at least when investigating the five categories of activities presented in this paper. Future research in this area is needed to develop classification models that are more sensitive to capture activities such as biking, stairclimbing as well as transitions from one activity to another.

Conclusions
The classification model developed in this study was shown to be accurate and reliable over 24 hours in UE. Our results show no significant benefit of using eight compared to six features to determine the time spent performing five activities as far as the present classification model is concerned. The study highlights the potential use of this classification model in applied research aimed at investigating the time spent performing activities.

Author Contributions
Conceived and designed the experiments: MER FR ED. Performed the experiments: MER. Analyzed the data: MER FR GL. Contributed reagents/materials/analysis tools: MER FR GL. Wrote the paper: MER FR GL ED.