Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

KDTMD: Knowledge distillation for transportation mode detection based on KAN

  • Rui Li,

    Roles Data curation, Investigation, Methodology, Project administration, Resources, Software, Writing – original draft

    Affiliation Zhejiang Technical Institute of Economics, Hangzhou, Zhejiang, China

  • Xueyi Song ,

    Roles Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Validation, Writing – original draft

    songtbn@gmail.com

    Affiliation China Construction Civil Engineering Co. Ltd., Beijing, China

  • Yongliang Xie

    Roles Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Zhejiang Technical Institute of Economics, Hangzhou, Zhejiang, China

Abstract

With the progress in sensor technology and the spread of mobile devices, transportation mode detection (TMD) is gaining importance for health and urban traffic improvements. As mobile devices become more lightweight, they require more efficient, low-power models to handle limited resources effectively. Despite extensive research on TMD, challenges remain in capturing non-stationary temporal dynamics and nonlinear fitting capabilities. Additionally, many existing models exhibit high space complexity, making lightweight deployment on devices with limited computing and memory resources difficult. To address these issues, we propose a novel deep TMD model based on discrete wavelet transform (DWT) and knowledge distillation (KD), called KDTMD. This model consists of two main modules, i.e., DWT and KD. For the DWT module, since non-stationary time variations and event distribution shifts complicate sensor time series analysis, we use the DWT modules to disentangle the sensor time series into two parts: a low-frequency part that indicates the trend and a high-frequency part that captures events. The separated trend data is less influenced by event distribution shifts, effectively mitigating the impact of non-stationary time variations. For the KD module, it includes the teacher model and student model. Specifically, for teacher model, to address the nonlinearities and interpretability, we incorporate T-KAN, which is composed of multiple layers of linear KAN that employ learnable B-spline functions to achieve a richer feature representation with fewer parameters. For student model, we develop the S-CNN, which is trained efficiently by T-KAN through KD. The KDTMD model achieves 97.27% accuracy and 97.29% F1-Score on the SHL dataset, and 96.56% accuracy and 96.72% F1-Score on the HTC dataset. Additionally, the parameters of the KDTMD model are only about 10% of the smallest baseline.

Introduction

In contemporary society, smartphones, smartwatches, and other wearable devices, equipped with advanced sensors for efficient user data collection have become widespread. This capability has significantly advanced the development of Human Activity Recognition (HAR) technology [1]. A key application of HAR is Transportation Mode Detection (TMD), which identifies movement patterns through sensor data analysis. By analyzing movement patterns in real-time, TMD technology enhances traffic prediction [2], logistics route optimization [3], carbon footprint estimation [4], and other intelligent services. TMD technology has advanced from GPS-based approaches [58] to multi-sensor integration [912], and now utilizes deep learning and AI [1316] for improved efficiency and accuracy.

In this paper, we aim to develop a lightweight model for TMD that can be directly deployed on mobile devices such as smartphones. The model is designed to accurately and swiftly determine which of the following eight specific transportation modes the user is in: being stationary, walking, running, cycling, driving, taking a bus, taking a train, and traveling by subway. It is challenging to achieve this goal, as several key challenges remain to be addressed:

  • Non-stationary temporal dynamics. In the complex time series and dynamic spatial correlation issues of TMD, understanding and capturing the non-stationary temporal changes and spatial dynamics are highly challenging for TMD. This is because the data we collect from sensors are usually entangled by a stable trend sequence and a fluctuating event sequence (as shown in Fig 1, the raw x-axis signal data collected from the gyroscope sensor of a smartphone while riding a bike includes both a stable trend sequence and a fluctuating event sequence, with a sampling frequency of 100 Hz over a duration of 5s.), where the fluctuating events frequently undergo distribution shifts. Common solutions for TMD generally feed sensor time series directly into the network. Those methods that did not distinguish between these two sequences struggled to avoid distribution shifts, making it challenging to make reasonable predictions.
  • Non-linear fitting capability. Transportation data often exhibit complex nonlinear relationships, necessitating models that can capture this complexity. Traditional methods typically have notable shortcomings when it comes to fitting nonlinear functions. These shortcomings include a large number of parameters and poor interpretability. Additionally, these methods often struggle with high-dimensional data and may not be very effective in TMD problems due to their limited expressive capabilities [16].
  • Space complexity for lightweight application. The remarkable success of deep learning largely stems from its ability to handle vast amounts of data and manage complex models with high computational demands. However, for TMD, which requires processing large volumes of sensor data, deploying these complex traditional deep learning models on resource-constrained devices such as mobile phones and embedded systems poses significant challenges due to their high computational demands and substantial storage needs. Traditional deep learning models often struggle to reduce this complexity while maintaining accuracy, especially when dealing with large-scale datasets [17].
thumbnail
Fig 1. Raw x-axis signal data from a gyroscope sensor while riding a bike, showing stable trend and fluctuating event sequences (sampling frequency: 100 Hz, duration: 5s).

https://doi.org/10.1371/journal.pone.0324752.g001

To address the aforementioned challenges, this paper proposes a novel model based on Discrete Wavelet Transform (DWT) and Knowledge Distillation (KD), referred to as the KDTMD model for TMD. In our model, to capture the complex temporal relationships from short-term fluctuations to long-term trends, we first apply DWT to disentangle the sensor time series into a stable trend sequence and a fluctuating event sequence. To reduce space complexity, we utilize the KD technique, which consists of two components: the teacher model, T-KAN, and the student model, S-CNN. For the teacher model, T-KAN, to capture richer feature representations while reducing the space complexity, we employ Kolmogorov-Arnold Networks (KAN) by replacing all MLP layers with learnable B-spline functions. This allows the model to maintain high expressiveness with fewer parameters. For the student model, S-CNN, to maintain a lightweight structure with fewer parameters, we utilize a simplified Convolutional Neural Network (CNN) architecture. This ensures that the model remains efficient and suitable for deployment on resource-constrained devices.

  • To address non-stationary time variations and event distribution shifts, we decompose the signal into high and low frequency components. This enables more effective identification and analysis of periodic and trend changes in transportation modes. By separating these components, we reduce the impact of event distribution shifts on trend data, thereby reducing the effects of non-stationary time variations.
  • To enhance the model’s non-linear fitting ability and interpretability, we use KAN. Based on the Kolmogorov-Arnold representation theorem, KAN represents multivariate continuous functions as combinations of univariate functions and additions [18]. Unlike traditional neural networks, KAN has learnable activation functions on edges, typically splines, which replace weight parameters. This design boosts flexibility, reduces parameters, and improves interpretability.
  • To reduce time and space complexity, we use knowledge distillation. This involves training a small student model under the guidance of a large teacher model, transferring the teacher’s knowledge to the student [19]. The student model can learn the behavior and decisions of the teacher model without having the same number of parameters, resulting in model compression and speedup. Through this, we lower the model’s complexity while retaining near-original performance. In TMD, this means that smaller and more efficient models can be used to handle large-scale datasets without significantly compromising accuracy, thus improving the utility and scalability of the model.

Related work

TMD technology has advanced from GPS-based approaches to multi-sensor integration and machine learning, and now to deep learning. This progression has significantly enhanced recognition accuracy and efficiency while laying the groundwork for intelligent transportation systems (Table 1).

GPS-based

Initial TMD techniques primarily used GPS data due to its global coverage and precision. Gong et al. [6] developed a GIS algorithm for processing GPS trip data, achieving an 82.6% success rate in identifying travel modes in NYC. Li et al. [7] combined GPS and GIS data with random forest algorithms for mode identification, while Zheng et al. [21] proposed supervised learning methods to identify transportation modes from raw GPS logs. Despite these advances, GPS-based methods face limitations from signal occlusion and high energy consumption.

Multi-sensor integration with machine learning

The integration of additional sensors like accelerometers and gyroscopes represented a significant advancement in TMD development. Feng et al. [24] demonstrated that accelerometer-only methods outperformed GPS-only ones, with the combination of both yielding the highest accuracy. Machine learning algorithms, including random forests and XGBoost [911], were employed to process these multi-source data. While this approach enhanced recognition capabilities, it remained limited by manual feature extraction, which is time-consuming and subjective.

Multi-sensor integration with deep learning

Recent years have seen deep learning significantly advance TMD through automatic feature extraction. CNNs [2831], RNNs [36], and Transformers [35] have been applied with varying success. Many other deep learning methods have also been applied, for example, Wang [33] introduced T2Trans using temporal convolutional networks, while Asci et al. [34] employed LSTM for TMD. Despite these advances, existing deep learning models often struggle with complex temporal analysis and maintaining a balance between model complexity and performance.

Technology has transformed TMD, allowing systems to manage intricate data and achieve higher accuracy in results. However, developing a model that can perform complex temporal analyse, offer high accuracy, and maintain a slim profile is still a difficult task.

Methodology

Overview

In this paper, we propose a lightweight transportation mode detection framework, KDTMD, based on Discrete Wavelet Transform (DWT) and Knowledge Distillation (KD)(We published our proposed KDTMD algorithm at the following website: https://github.com/RuiLi221/KDTMD). The objective is to create a lightweight model with a low number of trainable parameters while maintaining high efficiency. The KDTMD model is designed to be easily deployable on smart wearable devices with limited computing power and storage capacity. As illustrated in Fig 2, the KDTMD model primarily consists of two parts: (i) DWT for capturing non-stationary temporal changes and spatial dynamics, and (ii) KD for reducing time-space complexity and ensuring model lightweighting. The framework includes two main components: (a) a teacher model composed of KAN, and (b) a student model composed of convolutional layers. Specifically, the sensor data , where T denotes the length of the sliding window and F represents the number of sensor elements, is initially fed into the DWT. This process disentangles the sensor time series into event representations and trend representations . These two disentangled signals, and , are then simultaneously fed into both the teacher and student models in the KD module. The goal of this module is to transfer the generalization ability of a complex teacher model to a smaller student model. Our proposed teacher model, based on the KAN, effectively captures the complex nonlinear relationships and dynamic changes in transportation modes, thereby enhancing recognition capabilities. When the high-frequency signal and the low-frequency signal , along with their corresponding spatiotemporal features, are input into the KAN-based teacher model, this model uses multilayer linear KAN with weight parameters in the form of spline functions to efficiently extract the relevant spatiotemporal features and . In the student model, we use a Convolutional Neural Network (CNN) to process the high-frequency signal and the low-frequency signal . This CNN efficiently extracts the corresponding spatiotemporal features and through its local receptive fields and weight-sharing mechanisms. This approach not only reduces the number of parameters but also enhances the model’s ability to learn spatiotemporal hierarchies. In summary, the KDTMD framework leverages the strengths of DWT and KD to create a lightweight, efficient model suitable for deployment on devices with limited resources, while maintaining high accuracy in TMD (Fig 3).

DWT

The Bayesian structural time series model [37] suggests that traffic time series consist of a stable long-term trend and volatile events, which are independent of each other. This independence, based on the independent mechanisms assumption [38], implies that when one component of the traffic time series changes due to distribution shifts, the other can remain constant. Building on this concept, we enhance model adaptability to non-stationary temporal changes by separating traffic time series into distinct components. By integrating Discrete Wavelet Transform (DWT) into our framework, we effectively disentangle traffic time series into more manageable elements.

As in a two-level DWT (as shown in Fig 4), the input signal is decomposed into a low-frequency component that captures the trend, and two high-frequency components and that capture events, as shown in Fig 4. Here, g and h denote the low-pass and high-pass filters of the wavelet. For a traffic time series , multi-level wavelet transforms with these filters can extract a smooth low-frequency trend and multiple high-frequency event components. The DWT on input traffic data can be formulated as follows ( denotes convolution, and 2 indicates output downsampling by 2.):

(1)(2)(3)

After DWT, the low and high frequency components have reduced time steps due to downsampling. To match the input length and return frequency data to the time domain, we apply upsampling and IDWT with inverse filters and . Also, we sum all inverse high-frequency components as events to retain non-stationary information without adding many channels.

We adopted the fully connected module after IDWT. This design can not only avoid information loss caused by discarding high-frequency components, but also prevent the increase in computational load caused by parallel processing of all high-frequency components. This module converts trends and events into high-dimensional representations , thereby enhancing the expressiveness of the subsequent spatio-temporal network. The formulations for the IDWT and fully-connected module are expressed as Eq 4 (, , , are learnable parameters):

(4)(5)

Following the disentangling flow layer, we extract the separated trend and event representations from the traffic data, thereby reducing the impact of non-stationary time variations. Through experimentation, we select the most suitable wavelet from widely used ones to decompose the sensor time series effectively. For details on the selection process, see the “Effect of different hyperparameters” section in Experiments, which explores various wavelet basis functions in DWT.

Knowledge distillation

In the knowledge distillation module, we design the teacher model with adequate trainable parameters and the lightweight student model with limited parameters. On one hand, the student model can learn as much knowledge as possible from the teacher model, thereby imitating the teacher model’s predictive capabilities. On the other hand, the student model can refer to the input label values for learning, correcting any potential erroneous knowledge it may have learned [19].

KAN.

KAN leverages the Kolmogorov-Arnold theorem, which provides a method for expressing continuous multivariable functions as a sum of single-variable functions [18]. As shown in Eq 6, for a smooth function :

(6)

where and . Unlike MLPs, which use fixed activation functions on their nodes, KAN feature learnable activation functions on their edges. In KAN, there are no linear weights, every weight parameter is replaced by a univariate function parametrized as a spline, as shown in Fig 3a. The functions that are computed on the edges are represented as B-splines with the spline parameters being the learnable parameters of the network. Compared to MLPs, KAN provide a more nuanced way of capturing complex patterns in data, and are better at modeling complex real valued functions than MLP, thus better mimicking the way information is encoded in sensors. This design enhances the model’s ability to express intricate features, making it more flexible and adaptable to diverse data distributions. Additionally, KAN improve interpretability by allowing intuitive visualization of the learned functions, making the model’s decision-making process more transparent. For TMD problems, these univariate functions can automatically adjust their coefficients based on different characteristics of input data, such as gyroscope, acceleration, and magnetic inputs. This adaptability allows the model to handle various data patterns, providing more nuanced feature expressions for training and enhancing the model’s accuracy and interpretability.

Teacher model.

As shown in Fig 3b, we adopt an efficient teacher model, T-KAN, which consists of multiple layers of linear KAN. Specifically, our proposed T-KAN includes modules for processing high-frequency signals, low-frequency signals, fusion, and classification. The high-frequency signal module is composed of a single-channel KAN layer, a multi-channel KAN layer, and a dropout layer. Similarly, the low-frequency signal module has a similar structure to the high-frequency signal module. When signals and are fed into the high and low-frequency signal processing modules respectively, they first pass through the single-channel KAN layer to extract the univariate spatiotemporal features of high-frequency signals: , , and so on, and the univariate spatiotemporal features of low-frequency signals: , , and so on (as shown in Table 2). In the model, the number of units for all KAN layers is set to 8, the grid size is set to 10, and the pooling size of the pooling layer is set to 0.2. The time dimension of the input T = 500, and the feature dimension D = 1, and .

thumbnail
Table 2. Variable names and descriptions for univariate and integrated spatiotemporal features.

https://doi.org/10.1371/journal.pone.0324752.t002

The univariate spatiotemporal features for each sensor are then passed through the Multi-channel KAN layer for further feature learning, producing the integrate spatiotemporal features of high-frequency: , , , , and the integrate spatiotemporal features of low-frequency: , , , , (show in Table 2) where , .

Then all the integrated spatiotemporal features of high-frequency and low-frequency above are fused together in the merge layer to form feature where Tm = 500 and Dm = 32. these features are then transformed by the linear KAN layer into the final feature vector where Df = 8. Finally, the softmax layer is used to classify the corresponding transportation mode.

Student model.

In this section, we introduce a streamlined student model called S-CNN (show in Fig 3c) and enhance its training efficiency through the application of knowledge distillation. Similar to the teacher model, the student model also includes modules for processing high-frequency signals, low-frequency signals, fusion, and classification, However, compared to the teacher model, the student model has a simpler structure and fewer parameters, The filters of all CNN is set to 3, the kernel size is set to 3 and the units of MLP layer is set to 8. The high-frequency signal module of the student model is composed of a Single-channel CNN layer and a Multi-channel CNN layer, and similarly, the low-frequency signal module of the student model adopts the same structure. To facilitate the subsequent knowledge distillation task, the student model and the teacher model maintain the same input, that is, the student model is also fed with event representations and trend representations , which are then processed through the Single-channel CNN layer and Multi-channel CNN layer for feature extraction to form the integrate spatiotemporal features of high-frequency: , , , , and the integrate spatiotemporal features of low-frequency: , , , .

Then all the integrated spatiotemporal features of high-frequency and low-frequency above are fused together in the merge layer to form feature where , . these features are then transformed by the MLP layer into the final feature vector where . Finally, the softmax layer is used to classify the corresponding transportation mode.

Distillation training.

Within the knowledge distillation framework, we define both the complex teacher model with better generalization ability and the lightweight student model with limited parameters. On one hand, the student model can learn as much knowledge as possible from the teacher model, thereby imitating the teacher model’s predictive capabilities. On the other hand, the student model can refer to the input label values for learning, correcting any potential erroneous knowledge it may have learned. Specifically, we use a “softmax” output layer to generate class probabilities, which is a common practice in neural networks. Different from the ordinary softmax, we introduced the concept of temperature to better express the latent importance of each predicted value. The specific implementation is as follows: First, we obtain the corresponding values from the output of the last fully connected layer of both the teacher and the student. Then, we take the logits to get and . After that, we define the soft targets and (show in Eqs 7, 8) which represents the probability that the sample belongs to the i-th category for the teacher and student, respectively.

(7)(8)

The final loss of the entire model is composed of two parts. One part is the distillation loss , which represents the loss of knowledge transferred from the teacher to the student at temperature T. is defined as the cross-entropy loss determined by and (as shown in Eq 9). The other part is the student loss , which represents the loss of knowledge learned by the student from the input labels at temperature 1 (as shown in Eq 10). The final loss of the model is the sum of these two parts (as shown in Eq 11), where is the distillation factor.

(9)(10)(11)

The output

Finally, based on the final features and the final loss of the model , we classified the transportation mode and obtained the final output , formulated as:

(12)

We investigate the effectiveness of our KDTMD with the goal to answer the following research questions:

RQ1: Does KDTMD outperform other baselines?

RQ2: How do hyper-parameters (e.g., temperature, alpha) affect KDTMD?

RQ3: How do different components in KDTMD affect model performance?

RQ4: How does KDTMD perform in terms of computational and resource efficiency?

Experiments

Datasets

We assessed the effectiveness of the KDTMD (Knowledge Distillation for Transportation Mode Detection) algorithm using the SHL [39] and HTC [40] real-world datasets. To optimize the model for edge devices, we only utilized low-consumption sensors, including the gyroscope (gyr: x, y, z in ), linear accelerometer (lacc: x, y, z in ), and magnetometer (mag: x, y, z in ). For the SHL dataset, we also included barometric pressure (pre: in hPa). The data preprocessing steps included normalization and segmentation. Subsequently, the data were split into training, validation, and test sets at a ratio of 70%, 20%, and 10%, respectively.

SHL dataset.

The Sussex-Huawei Locomotion-Transportation (SHL) dataset, collected over 7 months in the UK and totaling approximately 753 hours, captures various real-life transportation modes such as standing, walking, running, cycling, driving, taking the bus, train, or subway. It includes 3-axis accelerometers, gyroscopes, linear accelerometers, magnetometers, orientation sensors, and a barometer, all sampled at 100 Hz. We utilized a subset of this dataset to evaluate our algorithm, totaling approximately 390 hours. The durations of different transportation modes in SHL dataset are shown in the Table 3.

thumbnail
Table 3. Durations of different transportation modes in SHL dataset.

https://doi.org/10.1371/journal.pone.0324752.t003

HTC dataset.

The HTC dataset, collected from 150 HTC smartphone users, contains 100GB of data across 8,311 hours of various activities recorded at 100Hz from accelerometers, gyroscopes, and magnetometers. The dataset was gathered through two primary avenues: a university program involving 150 participating students and a group of 74 employees and interns. While it initially included activities like motorcycle riding and high-speed rail travel, we removed these to align with the SHL dataset. This resulted in a substantial and consistent dataset for assessing the scalability of our KDTMD model. The durations of different transportation modes in HTC dataset are shown in Table 4

thumbnail
Table 4. Durations of different transportation modes in HTC dataset.

https://doi.org/10.1371/journal.pone.0324752.t004

Normalization.

To deal with the inconsistency of dimension and numerical range between data from different sensors, we performed Z-Score normalization for the individual components of the sensors. This step helps to calibrate the data so that it has a uniform scale benchmark, which can be expressed by the following formula (formula 13):

(13)

where represents the mean value of each component, and is the standard deviation of the corresponding component.

Segmentation.

To enhance accuracy while maintaining manageable computational complexity and processing time, we employed a fixed-length sliding window technique to segment continuous sensor time series into shorter, more manageable pieces for feature analysis. Specifically, the SHL dataset used a window length of 500, whereas the HTC dataset employed a window length of 450. These window lengths were chosen to provide an appropriate scale of data for feature learning in both datasets.

Baselines

  • RF: Random Forest (RF) enhances the ensemble’s prediction accuracy through the combination of multiple decision trees [41].
  • MLP: The Multilayer Perceptron (MLP) extracts and learns features from the data through multiple fully connected layers of neurons [16].
  • CNN: Convolutional Neural Networks (CNNs) capture spatial features in data through convolutional and pooling layers [42].
  • LSTM: Long Short-Term Memory (LSTM) maintain the temporal dependencies in sequence data through their gated architecture, effectively capturing long-term patterns and relationships [43].
  • T2Trans: T2Trans, founded on Temporal Convolutional Networks (TCNs), utilizes the properties of temporal convolution to bolster the precision of TMD [33].
  • CL: CL-TRANSMODE(CL) consists of three layers: data preprocessing, a CNN for feature extraction, and an LSTM network with dropout for enhanced learning [13].
  • MSRLSTM: MSRLSTM integrates residual and LSTM layers to extract features from sensor data, leveraging a residual unit to accelerate learning and an attention model to enhance recognition accuracy [16].
  • MSCPT: Multi-sensor cross-place transportation mode recognition algorithm (MSCPT) comprises three main components: a Multi-Sensor Neural Network model, a variant of the bootstrap ensemble learning method, and data augmentation strategies [42].
  • MLI: Multimodal Learning Integrator (MLI)The Learning Integrator (MLI) consists of a four-layer hierarchical neural network and Random Forest (RF) classifiers [44].

Metrics

We assessed model performance using four key metrics: accuracy, precision (14), recall (15), and F1 score (16). Accuracy reflects the model’s overall ability to correctly classify instances; precision focuses on the accuracy of the model’s positive predictions; recall measures the model’s ability to identify all positive instances; and the F1 score is the harmonic mean of precision and recall, providing a comprehensive reflection of the model’s classification performance. (16):

(14)(15)(16)

where Nm = 8 represents the number of transportation modes.

Experimental Settings

We utilized the wavelet transform functions provided by the PyWavelets (pywt) library to perform DWT. Within the Keras deep learning framework, we defined a Distiller model and implemented the calculation of the total distillation loss, which includes both the distillation loss and the student loss. We employed the Adam optimizer with an initial learning rate of 0.001. The model was trained for 150 epochs with a batch size of 256, and the data input order was shuffled during training to mitigate overfitting. Additionally, we provided the configuration details and relevant software versions (shown in Table 5).

Results

Comparison with different baselines (RQ1).

As demonstrated in Figs 5, 6, Tables 6, 7, and 8, the results show that deep learning algorithms such as MLP, CNN, and LSTM generally outperform traditional machine learning methods. The accuracy of single deep learning models typically ranges from 75% to 85%, while ensemble methods like CL-TRANSMODE, MSRLSTM, and MLI achieve accuracy rates mostly above 85%, surpassing individual deep learning models. This indicates that deep learning algorithms possess superior feature extraction capabilities. Hybrid frameworks that combine multiple networks can leverage the strengths of each approach, effectively capturing the spatiotemporal characteristics of sensor data and resulting in higher predictive accuracy for TMD. Among these, the MSRLSTM algorithm achieved the highest accuracy compared to other deep learning methods. This improvement suggests that MSRLSTM enhances LSTM by incorporating residual units and attention mechanisms, leading to stronger feature representation and significantly better predictive accuracy than standard LSTM (78.03% on SHL and 76.73% on HTC).

thumbnail
Fig 5. The accuracy of different algorithms for TMD on SHL and HTC datasets.

https://doi.org/10.1371/journal.pone.0324752.g005

thumbnail
Fig 6. The confusion matrix of KDTMD model on SHL and HTC datasets.

https://doi.org/10.1371/journal.pone.0324752.g006

thumbnail
Table 6. F1-Score of different algorithms for TMD on SHL dataset.

https://doi.org/10.1371/journal.pone.0324752.t006

thumbnail
Table 7. F1-Score of different algorithms for TMD on HTC dataset.

https://doi.org/10.1371/journal.pone.0324752.t007

thumbnail
Table 8. The precision, recall, and F1-Scores of the KDTMD model on the SHL dataset.

https://doi.org/10.1371/journal.pone.0324752.t008

The proposed KDTMD model achieved an accuracy of 97.27% on the SHL dataset and 96.56% on the HTC dataset, significantly outperforming the MSRLSTM algorithm and other baseline results. This demonstrates the critical role of the introduced modules in enhancing TMD. Specifically, the DWT module decomposes traffic time series into a low-frequency trend component and a high-frequency event component, mitigating the effects of non-stationary variations. Within the knowledge distillation framework, our teacher model, T-KAN, employs linear KAN layers with learnable B-spline functions to efficiently capture rich features, enhancing the model’s nonlinear capabilities and interpretability. This teacher model guides a lightweight student model, S-CNN, through the distillation process, ensuring rapid and accurate traffic mode detection. Together, these components ensure high precision, swift predictions, and manageable model complexity.

The provided ROC curves illustrate the performance of the KDTMD model on two distinct datasets, SHL and HTC (as shown in Fig 7). For the SHL dataset, the micro-average ROC curve achieves an area under the curve (AUC) of 0.98, indicating excellent overall classification performance. Similarly, the macro-average ROC curve also attains an AUC of 0.97, reflecting consistent performance across all classes. Individual class performances are strong, with most classes achieving AUC values close to 0.99, and even the lower-performing classes (e.g., class 6 and class 7) still demonstrating respectable AUC values of 0.95 and 0.93, respectively. On the HTC dataset, the model maintains a high level of performance, with the micro-average ROC curve achieving an AUC of 0.97. The macro-average ROC curve similarly achieves an AUC of 0.97, indicating robust performance across all classes. Most classes exhibit AUC values above 0.95, with several classes reaching near-perfect AUC values of 0.99. Even the lower-performing classes (e.g., class 0 and class 6) achieve AUC values of 0.96 and 0.94, respectively. Overall, the KDTMD model demonstrates strong classification capabilities on both datasets, with high AUC values across all classes, underscoring its effectiveness in handling multi-class classification tasks.

The Matthews Correlation Coefficient (MCC) values for the KDTMD model on the SHL and HTC datasets are both exceptionally high, at 0.9503 and 0.9449 respectively. These values indicate outstanding classification performance across both datasets. The consistency of these high MCC values across SHL and HTC demonstrates the model’s robustness and reliability in handling multi-class classification tasks, further underscoring its effectiveness in real-world applications.

Effect of different hyperparameters (RQ2).

To better understand the impact of each hyperparameter on our KDTMD model, we conducted experiments to tune the hyperparameters of the KDTMD model, we mainly adjusted the following core hyperparameters: (i) different wavelet basis functions in DWT, (ii) different numbers of Units, Grid_size and Spline_order in KAN for teacher model, (iii) different numbers of filters in convolutional layer for student model, (iv) the parameter of temperature in distillation loss, the parameter of alpha in total loss, and the learning rate in knowledge distillation.

(i) We tested six wavelet basis functions for the wave parameter: db1 (Daubechies 1), sym2 (Symlet 2), coif1 (Coiflet 1), bior1.1 (Biorthogonal 1.1), rbio1.1 (Reverse Biorthogonal 1.1), and haar (Haar).(as shown in Fig 8). In our experiments, the sym2 wavelet basis function demonstrated superior performance in handling non-stationary time variations and event distribution shifts, achieving an accuracy rate of 97.27%. This indicates that the symmetry and higher vanishing moments of sym2 enable it to better capture the trend changes in signals while reducing the impact of event distribution shifts. Therefore, we selected sym2 as the wavelet basis function for our final model.

(ii) We conducted experiments to optimize the teacher KAN model by adjusting three key parameters: Units, Grid_size, and Spline_order (as shown in Fig 9).

thumbnail
Fig 9. Different numbers of Units, Grid_size and Spline_order in KAN for teacher.

https://doi.org/10.1371/journal.pone.0324752.g009

  • For Units, accuracy improved with increasing units, but the parameter count also rose. The teacher model achieved 98.15% accuracy at units=8. Beyond this, accuracy gains were minimal despite a significant parameter increase (e.g., the parameter count at units=15 was approximately double that of units=8). Thus, units=8 was chosen for balancing accuracy and complexity.
  • Regarding grid_size, accuracy showed a consistent upward trend with increasing grid_size, while the parameter count remained stable. The optimal accuracy was achieved at grid_size=10, beyond which no further accuracy enhancement was observed.
  • In KAN, the number of B-spline basis functions is determined by grid_size and spline_order, specifically as the sum of grid_size and spline_order. Our experiment fixed grid_size at 10 and varied spline_order from 1 to 5, resulting in B-spline basis functions ranging from 11 to 15. This setup allowed us to observe how changes in the number of B-spline basis functions affect the teacher model’s accuracy. We observed that accuracy generally improved with higher spline_order values, though this came with a modest increase in parameters. However, when spline_order exceeded 3, accuracy began to decline. Therefore, spline_order=3 corresponds to the number of B-spline basis functions being 13, offering the best trade-off between accuracy and parameter efficiency.

(iii) We conducted experiments to evaluate different student models, including variations of MLP, TCN, LSTM, and CNN, with a focus on optimizing CNN through adjustments to key hyperparameters such as filter size, pooling strategy, dropout, and batch normalization. The results are summarized in the Table 9 below:

thumbnail
Table 9. Performance comparison of different student models on the SHL dataset.

https://doi.org/10.1371/journal.pone.0324752.t009

The table presents a comprehensive comparison of various student models in terms of their performance metrics, computational efficiency, and architectural details. Here’s a detailed analysis:

  • S-CNN1 (filter=3) demonstrates superior accuracy compared to S-TCN and S-LSTM, with a notably higher ACC of 95.23%. Moreover, its Predicting Time (0.18 ms) is significantly shorter than that of S-TCN (0.95 ms) and S-LSTM (3.58 ms), making it more suitable for TMD systems requiring rapid responses. Compared to S-MLP, which has a similar Predicting Time (0.19 ms), S-CNN achieves a higher ACC (95.23% vs. 94.20%) with fewer FLOPs (2.25e+02 M vs. 2.93e+02 M), indicating better computational efficiency and accuracy.
  • S-CNN3, which incorporates dropout (dr(rate=0.1)), shows a improvement in ACC (97.27%) over S-CNN1 (95.23%) with minimal changes in Predicting Time and FLOPs. This suggests that dropout enhances model generalization without compromising efficiency.
  • S-CNN2, S-CNN3, and S-CNN4 reveal that increasing filter size (f) boosts model accuracy but also raises parameter count and computational demands. The optimal balance is achieved at f=3, where S-CNN3 attains the highest ACC of 97.27%, indicating that filter size significantly impacts model performance.
  • S-CNN5, which applies pooling (pl(ps=4)), exhibits a reduction in parameters but suffers from lower accuracy, implying that pooling may discard crucial information and is thus less effective for this task.
  • S-CNN6, equipped with batch normalization (bn), shows a slight increase in parameters and Predicting Time but a decrease in ACC, suggesting that batch normalization may not be as effective as other regularization techniques like dropout in this context.

(iv) We conducted experiments with various hyperparameters in knowledge distillation, including different temperatures in the distillation loss, different Alpha values in the total loss, and learning rates specific to knowledge distillation. The results are presented in Fig 10.

thumbnail
Fig 10. Different hyperparameters in knowledge distillation.

https://doi.org/10.1371/journal.pone.0324752.g010

  • In knowledge distillation, varying temperature (T)(as shown in Eq 17) affects the balance between hard and soft targets, here zi represents the raw score (logit) computed for each class, while qi is the probability derived from zi using the softmax function. At lower T values, the model focuses more on hard targets (crisp predictions), while higher T values emphasize soft targets (probabilistic distributions). Through extensive testing, we observed that T = 5 provides the best trade-off, maximizing accuracy and stability (as shown in Fig 10a). This selection ensures optimal knowledge transfer from the teacher to the student model.(17)
  • We explored the impact of different alpha values on model performance when calculating the total loss (as shown in Eq 11), with results presented in Fig 10b. In knowledge distillation, adjusting alpha alters the emphasis between the distillation loss and student loss . Different alpha values affect the model’s ability to learn from the teacher’s probabilistic guidance. Our experiments revealed that setting alpha to 0.3 optimally balances these two aspects, achieving the highest accuracy and ensuring robust knowledge transfer from the teacher to the student model, leading to superior model performance.
  • We explored the impact of different learning rates ranging from 0.001 to 0.01 (as shown in Fig 10c), and found that the accuracy of TMD was highest when the learning rate was set to 0.001.

Ablation studies (RQ3).

We conducted ablation studies to assess the impact of removing DWT, replacing KAN with MLP, and modifying the CNN student model architecture. The results are summarized in the Table 10 below:

thumbnail
Table 10. Impact of removing dwt, replacing KAN with mlp, and modifying cnn student model architecture.

https://doi.org/10.1371/journal.pone.0324752.t010

The table presents the results of ablation studies on different components and architectures of the TMD model. Here’s a detailed analysis:

  • NoDWT: Removing DWT leads to a significant drop in accuracy (Acc=81.60%) compared to other models, despite having the fewest Parameters (41,059) and lower FLOPs (3.26e+02 M). This indicates that DWT plays a crucial role in enhancing model performance by decomposing sensor time series into low and high frequency events, reducing the impact of non-stationary variations.
  • T-MLP: Replacing KAN with MLP in the teacher model results in a significant drop in accuracy to 83.23%. Since the change only involves the teacher model’s architecture, the student model’s computational efficiency remains the same. However, the notable decrease in the student model’s accuracy indicates that This demonstrates that KAN’s B-spline functions enable it to model complex patterns more efficiently with fewer parameters, thus making it more effective than MLP in feature extraction and positively impacting the student model’s performance.
  • S-CNN3 achieves an accuracy of 97.27%, while S-CNN7 and S-CNN8 reveal that increasing the layers in the Single-channel CNN layer and Multi-channel CNN layer raises Parameters and FLOPs, yet reduces accuracy to 96.83% and 97.10%, respectively. This indicates potential overfitting and underscores the importance of architectural optimization to prevent performance degradation.

Computational and resource efficiency (RQ4).

In order to deeply analyze the computational complexity of the algorithm, we compared the training time, number of parameters, predicting time, and memory usage of various algorithms. The training time for each algorithm was calculated over one hundred epochs, measured in seconds. As shown in Table 11, the KDTMD model has the shortest training time at 19,532 seconds, while MSRLSTM has the longest at 89,753 seconds. In terms of parameters, KDTMD has the fewest parameters (48,803), which is roughly 10% of the smallest baseline parameters. For predicting time, measured in milliseconds, KDTMD is the fastest with 0.18 ms, followed by MLP at 0.19 ms, whereas LSTM is the slowest at 2.96 ms. Regarding memory usage, KDTMD uses the least memory at 500 M (which is the distilled student model and requires further quantization for mobile deployment), approximately 7.7% of that used by MSRLSTM, which consumes the most memory at 6487 M. This indicates that the KDTMD model achieves high accuracy while substantially reducing parameter size, and demonstrates superior performance in training efficiency, prediction speed, and memory usage, making it highly suitable for lightweight deployment.

thumbnail
Table 11. Computational and resource efficiency of different algorithms.

https://doi.org/10.1371/journal.pone.0324752.t011

Conclusion

This paper focuses on fine-grained TMD. We propose a novel KDTMD model including DWT and KD modules. By combining DWT with KD, our model addresses non-stationary temporal dynamics, nonlinearity, and space complexity while enhancing performance and efficiency. The DWT module decomposes sensor time series into a trend-indicating low-frequency component and an event-indicating high-frequency component, minimizing the impact of event distribution shifts and reducing the effects of non-stationary variations. In our knowledge distillation setup, we utilize an efficient teacher model, T-KAN, which is based on linear KAN layers with learnable B-spline functions for enhanced feature representation and model interpretability. The student model, S-CNN, is trained by T-KAN, accelerating its learning process. Experimental results demonstrate that the KDTMD model achieved high accuracy rates of 97.27% and 96.56% on the SHL and HTC datasets, surpassing other methods. Notably, The KDTMD model achieves high accuracy with only about 10% of the parameters of the smallest baseline model. Furthermore, it also excels in training efficiency, prediction speed, and memory usage, making it ideal for lightweight deployment.

While the model demonstrates strong performance under ideal conditions, maintaining optimal accuracy in high-noise environments poses a challenge. Additionally, adapting the model to accommodate emerging transportation modes remains an area requiring further research and development. Future research will investigate self-supervised learning techniques to enhance the model’s ability to learn from unlabeled data. Additionally, future research will devote attention to refining the model’s inference mechanisms to ensure efficient real-time operation on edge devices. We look forward to applying the KDTMD model to a variety of mobile intelligent services, such as traffic prediction and logistics route optimization, with the aim of enhancing service quality and optimizing user experience.

References

  1. 1. Deng S, Chen J, Teng D, Yang C, Chen D, Jia T, et al. LHAR: Lightweight human activity recognition on knowledge distillation. IEEE J Biomed Health Inform. 2024;28(11):6318–28. pmid:37494155
  2. 2. Sayed SA, Abdel-Hamid Y, Hefny HA. Artificial intelligence-based traffic flow prediction: a comprehensive review. J Electr Syst Inf Technol. 2023;10(1):13.
  3. 3. Xia H, Qiao Y, Jian J, Chang Y. Using smart phone sensors to detect transportation modes. Sensors (Basel). 2014;14(11):20843–65. pmid:25375756
  4. 4. Lorintiu O, Vassilev A. Transportation mode recognition based on smartphone embedded sensors for carbon footprint estimation. In: 2016 IEEE 19th International conference on intelligent transportation systems (ITSC). IEEE; 2016. p. 1976–81.
  5. 5. Huang H, Cheng Y, Weibel R. Transport mode detection based on mobile phone network data: a systematic review. Transp Res Part C Emerg Technol. 2019;101:297–312.
  6. 6. Gong H, Chen C, Bialostozky E, Lawson CT. A GPS/GIS method for travel mode detection in New York City. Comput Environ Urban Syst. 2012;36(2):131–9.
  7. 7. Li J, Pei X, Wang X, Yao D, Zhang Y, Yue Y. Transportation mode identification with GPS trajectory data and GIS information. Tinshhua Sci Technol. 2021;26(4):403–16.
  8. 8. Wang B, Gao L, Juan Z. Travel mode detection using GPS data and socioeconomic attributes based on a random forest classifier. IEEE Trans Intell Transport Syst. 2018;19(5):1547–58.
  9. 9. Zhao Y, Song L, Ni C, Zhang Y, Lu X. Road network enhanced transportation mode recognition with an ensemble machine learning model. In: Adjunct proceedings of the 2023 ACM international joint conference on pervasive and ubiquitous computing & the 2023 ACM international symposium on wearable computing. 2023. p. 528–33.
  10. 10. Alecci L, Alchieri L, Abdalazim N, Barbiero P, Santini S, Gjoreski M. Enhancing xgboost with heuristic smoothing for transportation mode and activity recognition. In: Adjunct proceedings of the 2023 ACM international joint conference on pervasive and ubiquitous computing & the 2023 ACM International symposium on wearable computing. 2023. p. 540–5.
  11. 11. Deng J, Xu J, Sun Z, Li D, Guo H, Zhang Y. Enhancing locomotion recognition with specialized features and map information via XGBoost. In: Adjunct Proceedings of the 2023 ACM international joint conference on pervasive and ubiquitous computing & the 2023 ACM international symposium on wearable computing. 2023. p. 551–6.
  12. 12. Molina-Campoverde JJ, Rivera-Campoverde N, Molina Campoverde PA, Bermeo Naula AK. Urban mobility pattern detection: development of a classification algorithm based on machine learning and GPS. Sensors. 2024;24(12):3884.
  13. 13. Qin Y, Luo H, Zhao F, Wang C, Wang J, Zhang Y. Toward transportation mode recognition using deep convolutional and long short-term memory recurrent neural networks. IEEE Access. 2019;7:142353–67.
  14. 14. Jeyakumar JV, Lee ES, Xia Z, Sandha SS, Tausik N, Srivastava M. Deep convolutional bidirectional LSTM based transportation mode recognition. In: Proceedings of the 2018 ACM international joint conference and 2018 International symposium on pervasive and ubiquitous computing and wearable computers. 2018. p. 1606–15.
  15. 15. Tian Y, Hettiarachchi D, Kamijo S. Transportation mode detection combining CNN and vision transformer with sensors recalibration using smartphone built-in sensors. Sensors (Basel). 2022;22(17):6453. pmid:36080912
  16. 16. Wang C, Luo H, Zhao F, Qin Y. Combining residual and LSTM recurrent networks for transportation mode detection using multimodal sensors integrated in smartphones. IEEE Trans Intell Transp Syst. 2020;22(9):5473–85.
  17. 17. Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. Int J Comput Vis. 2021;129(6):1789–819.
  18. 18. Liu Z, Wang Y, Vaidya S, Ruehle F, Halverson J, Soljačić M. Kan: Kolmogorov-arnold networks. arXiv Preprint. 2024. https://doi.org/10.48550/ arXiv:240419756
  19. 19. Hinton G. Distilling the knowledge in a neural network. arXiv Preprint. 2015. https://doi.org/10.48550/arXiv:150302531
  20. 20. Muhammad A, Aguiar A, Mendes-Moreira J. Transportation mode detection from GPS data: a data science benchmark study. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE; 2021. p. 3726–31.
  21. 21. Zheng Y, Chen Y, Li Q, Xie X, Ma W-Y. Understanding transportation modes based on GPS data for web applications. ACM Trans Web. 2010;4(1):1–36.
  22. 22. Zheng Y, Liu L, Wang L, Xie X. Learning transportation mode from raw GPS data for geographic applications on the web. In: Proceedings of the 17th international conference on World Wide Web. ACM; 2008. p. 247–56. https://doi.org/10.1145/1367497.1367532
  23. 23. Gonzalez P, Weinstein J, Barbeau S, Labrador M, Winters P, Georggi N. Automating mode detection using neural networks and assisted GPS data collected using GPS-enabled mobile phones. In: 15th World congress on intelligent transportation systems. 2008. p. 16–20.
  24. 24. Feng T, Timmermans HJP. Transportation mode recognition using GPS and accelerometer data. Transp Res Part C Emerg Technol. 2013;37:118–30.
  25. 25. Widhalm P, Nitsche P, Brändie N. Transport mode detection with realistic smartphone sensor data. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE; 2012. p. 573–6.
  26. 26. Zeng Z, Liu Y, Lu X, Zhang Y, Lu X. An ensemble framework based on fine multi-window feature engineering and overfitting prevention for transportation mode recognition. In: Adjunct proceedings of the 2023 ACM international joint conference on pervasive and ubiquitous computing & the 2023 ACM international symposium on wearable computing. 2023. p. 563–8.
  27. 27. Jahangiri A, Rakha H. Applying machine learning techniques to transportation mode recognition using mobile phone sensor data. IEEE Trans Intell Transp Syst. 2015;16(5):2406–17.
  28. 28. Yanyun G, Fang Z, Shaomeng C, Haiyong L. A convolutional neural networks based transportation mode identification algorithm. In: 2017 international conference on indoor positioning and indoor navigation (IPIN). IEEE; 2017. p. 1–7.
  29. 29. Liang X, Wang G. A convolutional neural network for transportation mode detection based on smartphone platform. In: 2017 IEEE 14th international conference on mobile Ad Hoc and sensor systems. IEEE; 2017. p. 338–42.
  30. 30. Kawakatsu M, Hyugaji T, Toyoshima Y. Moving state estimation by CNN from long time data of smartphone sensors: Sussex-Huawei locomotion challenge 2023. In: Adjunct proceedings of the 2023 ACM international joint conference on pervasive and ubiquitous computing & the 2023 ACM international symposium on wearable computing. 2023. p. 523–7.
  31. 31. Alam M, Haque M, Hassan M, Huda S, Hassan M, Strickland F. Feature cloning and feature fusion based transportation mode detection using convolutional neural network. IEEE Trans Intell Transp Syst. 2023;24(4):4671–81.
  32. 32. Masum A, Bahadur E, Shan-A-Alahi A, Chowdhury M, Uddin M, Al Noman A. Human activity recognition using accelerometer, gyroscope and magnetometer sensors: deep neural network approaches. In: 2019 10th international conference on computing, communication and networking technologies (ICCCNT). IEEE; 2019. p. 1–6.
  33. 33. Wang P, Jiang Y. Transportation mode detection using temporal convolutional networks based on sensors integrated into smartphones. Sensors (Basel). 2022;22(17):6712. pmid:36081169
  34. 34. Asci G, Guvensan M. A novel input set for LSTM-based transport mode detection. In: 2019 IEEE international conference on pervasive computing and communications workshops. IEEE; 2019. p. 107–12.
  35. 35. Drosouli I, Voulodimos A, Mastorocostas P, Miaoulis G, Ghazanfarpour D. TMD-BERT: a transformer-based model for transportation mode detection. Electronics. 2023;12(3):581.
  36. 36. Sharma A, Singh S, Udmale S, Singh A, Singh R. Early transportation mode detection using smartphone sensing data. IEEE Sens J. 2020;21(14):15651–9.
  37. 37. Qiu J, Jammalamadaka SR, Ning N. Multivariate bayesian structural time series model. J Mach Learn Res. 2018;19(68):1–33.
  38. 38. Peters J, Janzing D, Schölkopf B. Elements of causal inference: foundations and learning algorithms. The MIT Press; 2017.
  39. 39. Gjoreski H, Ciliberto M, Wang L, Morales F, Mekki S, Valentin S. The university of sussex-huawei locomotion and transportation dataset for multimodal analytics with mobile devices. IEEE Access. 2018;6:42592–604.
  40. 40. Yu M, Yu T, Wang S, Lin C, Chang E. Big data small footprint: the design of a low-power classifier for detecting transportation modes. Proc VLDB Endow. 2014;7(13):1429–40.
  41. 41. Ashqar HI, Almannaa MH, Elhenawy M, Rakha HA, House L. Smartphone transportation mode recognition using a hierarchical machine learning classifier and pooled features from time and frequency domains. IEEE Trans Intell Transp Syst. 2018;20(1):244–52.
  42. 42. Zhu Y, Luo H, Chen R, Zhao F, Guo S. MSCPT: Toward cross-place transportation mode recognition based on multi-sensor neural network model. IEEE Trans Intell Transport Syst. 2022;23(8):12588–600.
  43. 43. Asci G, Guvensan M. A novel input set for LSTM-based transport mode detection. In: 2019 IEEE international conference on pervasive computing and communications workshops (PerCom Workshops). IEEE. 2019. p. 107–12.
  44. 44. Oh S, Jeong H, Chung S, Lim J, Noh K. Multimodal sensor data fusion and ensemble modeling for human locomotion activity recognition. In: Adjunct proceedings of the 2023 ACM international joint conference on pervasive and ubiquitous computing & the 2023 ACM international symposium on wearable computing. 2023. p. 546–50.