Figures
Abstract
This work proposes a new hybrid model for joint indoor localization and activity recognition by combining a Convolutional Neural Network-Gated Recurrent Unit (CNN-GRU) model with a Markov Random Field (MRF) for better classification. The CNN-GRU successfully captures spatial and temporal dependencies, while the MRF models the mutual relations of activities and locations by estimating their joint probability distribution. The new system was tested on a public smart home dataset with four activities (sitting, lying, walking, and standing) and four indoor locations (kitchen, bedroom, living room, and stairs). The hybrid framework obtained an accuracy of 95% for activity recognition and 93% for indoor localization with a combined activity-location classification accuracy of 81%. Such results confirm the ability of the system to provide robust predictions in real-world smart environments, make it highly suitable for healthcare and intelligent living applications, and is efficient and deployable in real-world scenarios, addressing the critical challenges of noisy and dynamic indoor environments.
Citation: Sohaib S, Bokhari SM, Shafi M, Alhashmi A (2025) A novel approach for joint indoor localization and activity recognition using a hybrid CNN-GRU and MRF framework. PLoS One 20(8): e0328181. https://doi.org/10.1371/journal.pone.0328181
Editor: Abel C. H. Chen, Chunghwa Telecom Co. Ltd., TAIWAN
Received: February 14, 2025; Accepted: June 26, 2025; Published: August 7, 2025
Copyright: © 2025 Sohaib et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The dataset that has been utilized is available at the following link with repository name, “a-dataset-for-indoor-localization-using-a-smart-home-in-a-box”: https://github.com/rymc/a-dataset-for-indoor-localization-using-a-smart-home-in-a-box. The source code is publicly available under the repository name “Joint-Indoor-Localization-and-Activity-Recognition” at the following link: https://github.com/Mohsin783/Joint-Indoor-Localization-and-Activity-Recognition. All the results and findings generated in this research are present in the paper.
Funding: This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. UJ-20-018-DR. The authors, therefore, acknowledge with thanks the University of Jeddah for technical and financial support.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The joint localization and recognition of human activity is a vital and developing research area, as information regarding human locations and actions is essential for numerous applications, including intelligent healthcare systems, smart homes, physical training, surveillance, human-computer interaction, and environmental awareness. The sophisticated healthcare system must monitor the real-time whereabouts of older people and oversee their activities to facilitate independent living while mitigating the danger of falls and accidents. Conventional techniques for recognizing human actions rely on analyzing inertial measurement unit sensor data and computer vision approaches [1]. Cheng et al. [2] introduced several approaches for detecting human body activities using wearable acceleration sensors, yielding favorable identification outcomes. Motion signals based on magnetic induction were used in [3] to identify human activities. However, accelerometers cannot deliver precise long-term location data owing to measurement inaccuracies and drift, and magnetometers are susceptible to interference from changing ambient magnetic fields. A vision-based approach was used for action recognition and localization in [4], with promising results. Nonetheless, vision-based systems include drawbacks related to privacy violations and susceptibility to lighting changes and obstructions, limiting their use in everyday scenarios. Indoor positioning systems have garnered more interest in recent years. Diverse technologies, such as radio frequency identification (RFID), optical light, WiFi, ultrasound, Bluetooth, and ultra-wideband (UWB), have been used in indoor positioning systems, with some technologies employed for human activity detection and localization. A WiFi fingerprinting method for activity detection and indoor localization was presented in [5]. This technique can identify only a limited number of locations inside interior environments since WiFi fingerprinting relies on Received Signal Strength (RSS) and often attains meter-level precision. Among the previously described technologies, UWB is one of the most promising indoor positioning systems because it can achieve centimeter-level precision, resilience to multi-path fading and interference, and low power consumption [6,7].
Channel state information (CSI) of WiFi devices has been widely studied for human sensing applications, such as activity detection [8], gesture recognition, indoor localization [9], and healthcare [10]. The success is due to many unique features of WiFi, such as the pervasive deployment of commercial WiFi devices, immunity to lighting conditions and obstacles beyond the limit of cameras, and non-intrusive sensing without requiring extra effort from users. While there is much research on the specific task of WiFi human sensing, there is little focus on the integrated activity detection and indoor localization problem. Accomplishing the integrated task will give rise to several favorable human-computer interaction applications. In a smart home with Internet-of-Things (IoT) devices, the gadgets may show different reactions to the same gesture commands depending on the location of the users [11].
More recent research has sought hybrid sensor fusion techniques to mitigate indoor localization noise. For instance, Brena et al. [12] proposed an adaptive fusion technique based on different technologies such as Wi-Fi RSSI, inertial sensors, and magnetic field signals, which are time-synchronized among themselves through a Kalman filter for improving localization precision. Their technique dynamically adapts to environmental changes and user mobility, effectively mitigating uncertainties caused by signal fluctuations and obstructions. Such approaches highlight the importance of probabilistic filtering and context-aware adaptation to facilitate robust localization, which is especially beneficial for systems such as ours that require accurate user positioning as a precursor to activity recognition.
Similarly, Candeloro et al. [13] developed a hybrid deep learning architecture that combines CNNs for spatial feature extraction and LSTMs for temporal modeling to tackle challenges caused by noisy sensor readings and unpredictable human motion in indoor localization. Their approach performed robustly under changing conditions caused by movement and signal dynamic environments.However, unlike Candeloro et al., [13] our contribution includes a Markov Random Field (MRF) to jointly model the relationship between activities and locations, extending beyond localization alone. Nevertheless, combining more advanced noise-robust models is a promising direction to expand our framework’s applicability to uncontrolled, real-world deployments.
This study deals with joint indoor localization and activity recognition by introducing a Hybrid Convolutional Neural Network Gated Recurrent Unit (CNN-GRU) model for the classification of predefined locations and activities and then applying a Markov Random Field (MRF) to model the mutual relations between locations and activities by estimating their joint probability. We focus on the classification of four daily activities: sitting, lying, walking, and standing. In addition, we classify four indoor locations: kitchen, bedroom, living room, and stairs. To evaluate the effectiveness of our proposed models, we present the results using the necessary performance metrics, such as the confusion matrix, F1 score, precision, and recall. The proposed approach enhances joint prediction accuracy while ensuring seamless deployment in real-world scenarios.
The key contributions of this work are as follows:
- Hybrid CNN-GRU model: The proposed approach utilizes a hybrid CNN-GRU model to individually classify locations and activities while capturing spatial and temporal dependencies. Such an architecture allows effective modeling of dynamic activity patterns with location information.
- Joint activity-location classification: A novel aspect of the research uses a combined activity-location classification approach, where the outputs from separate activity and location models are fused to predict activity-location pairs. This integration leads to more accurate predictions compared to independent classification.
- Markov random field (MRF) integration: Introducing the Markov Random Field (MRF) framework allows for modeling dependencies among activities and locations. MRF effectively refines the joint prediction and improves the consistency among predicted activity-location pairs through the edge potential matrix encoding compatibility between activities and locations.
- Improved inference and accuracy: The integrated approach of CNN-GRU and MRF enhances the overall inference. The methodology, by maximizing the joint probability P(a,l) using the MRF, yields the most probable and consistent activity-location pairs for increased classification accuracy and decreased errors in prediction.
- Enhanced real-world applicability: The methodology exhibits real-world applicability in critical activity recognition and localization scenarios. The model can be applied to various domains, such as healthcare, smart homes, and location-based services, where activity and localization must be identified simultaneously.
The subsequent sections of this paper are organized as follows. Sect 2 outlines the related work in literature. Materials and methods employed in the proposed approach are presented in Sect 3. Sect 4 presents the proposed FLPMDP algorithm to diagnose arrhythmia. Sect 5 describes the experimental setup, presenting the results and their interpretation as the empirical basis. Finally, Sect 6 concludes the paper, summarizing the key findings and suggesting directions for future research.
2 Related work
In recent years, many efforts have been devoted to activity recognition and indoor localization using a wide range of methodologies and sensor modalities. In the work by Bock et al. [14], the authors presented temporal action localization models for inertial-based human activity recognition and showcased improved performance over classical inertial models. Likewise, Zandi et al. [15] introduced RoboFiSense, a novel framework for WiFi sensing-based robotic arm activity classification, which underlines the capability of non-invasive sensing. Pagan et al. [16] introduced an ultra-low-power activity recognition system with adaptive compressed sensing and highlighted energy-aware solutions for remote health monitoring. Neural network architecture advances have been instrumental in enhancing activity recognition, as seen in spatio-temporal graph convolutional networks introduced in [17], which nicely capture spatial and temporal dependencies from activity data. Moreover, self-supervised learning techniques [18] have decreased dependence on labeled data while boosting generalization in wearable sensor-based recognition tasks.
Of these, innovating multimodal approaches has been particularly striking. For example, Konak et al. [19] introduced real-time 2D pose estimation for optimal sensor placement in activity recognition, which fills the gap between video-based and sensor-based approaches. Similarly, [20] and HiFi-Net++ [21] underline the application of large language models and hierarchical fine-grained detection techniques to improve interpretability and robustness in localization tasks. Cross-domain challenges have also been addressed in recent works, with transfer learning approaches [22] allowing models to adapt across diverse datasets, thereby overcoming the constraints of domain-specific training. IoT-enabled frameworks, such as those presented in [23], allow for a robust analysis of sensor data for joint activity recognition and localization. WiFi sensing systems like in [15] underline the application of channel state information (CSI) for non-invasive activity classification and transition toward infrastructure-independent human activity recognition (HAR) solutions.
Deep learning has dramatically improved the performance of HAR by efficiently exploiting multimodal sensor data. Methods based on Convolutional Neural Networks (CNNs) and Spatio-Temporal Graph Convolutional Networks (ST-GCNs) [24] have achieved better recognition accuracy by modeling spatio-temporal dependencies. Transfer learning techniques [25] deal with cross-domain issues, allowing activity recognition across different datasets. Self-supervised learning frameworks [18] have also alleviated the reliance on labeled data and made HAR more scalable. Energy efficiency is still an essential aspect of IoT-based and wearable systems. Adaptive compressed sensing frameworks [22] would significantly decrease the cost of data transmission while achieving high recognition accuracy, which is especially valuable in applications of remote health monitoring, where power consumption is a very critical issue. Integration of attention mechanisms and noise-assisted models, as in AdaIFL [26], has improved localization and activity recognition in dynamic environments. Multi-modal approaches, e.g., wearable sensors and WiFi signals [27], give holistic solutions to the challenges in HAR and localization. Moreover, Graph Neural Networks (GNNs) [28] have been exploited to model complicated spatial temporal relationships in human activity recognition with superior accuracy and adaptability over conventional neural networks.
3 Problem formulation and dataset
This section presents the proposed research methodology, including a description of the chosen dataset, its preprocessing, and the architectures of the proposed models. The proposed framework for joint classification of localization and activity recognition is shown in Fig 1.
3.1 Problem formulation
The growing demand for context-aware applications in such systems as smart homes, health monitoring, and location-based services calls for accurate and reliable algorithms for indoor localization and activity recognition. While considerable progress has been made in these individual fields, the unification of activity detection and localization into a single framework is still a big challenge. This is due to the intricate interaction of spatial and temporal characteristics in activities and locations, along with their mutual relations.
More importantly, many of the recent approaches suffer from scalability, robustness, and practical usability issues, especially in dynamic and noisy indoor settings. These problems bring to light the requirement for a holistic approach that not only explains the interrelations between activity and place but also allows effective and successful joint classification.
3.2 Data preparation
This research use a publicly accessible dataset [29] collected in a smart home environment to evaluate our proposed method. Data was gathered in a two-bedroom, two-story terraced house in a residential area using the EurValve SHiB system [30]. The system comprises one wearable wrist device, four gateways placed in strategic locations (living room, kitchen, bedroom, and upstairs staircase landing), and a 4G network for sending data to a central server for processing. The wearable device contains a tri-axial ADXL362 accelerometer with a range of ±4g, sampling accelerometer data at 20 Hz on the x, y, and z axes. The wearable device transmits signals using Bluetooth Low Energy (BLE), and the Received Signal Strength Indicator (RSSI) values are logged at each gateway on packet arrival using the BLUEZ driver using a Broadcom BCM43438 combo Wi-Fi and BLE 4.1 System on Chip (SoC). It has a tabular structure containing all information of each epoch: timestamps, RSSI values, packet sequence numbers, accelerometer measurements, receiving gateway nodes, and labels indicating actual room locations and corresponding activities. The dataset targets the challenge of localization within smart homes with actions of sitting, lying, walking, and standing, among others, and locales such as kitchen, bedroom, living room, and stairs. The extensive dataset allows full assessment of joint localization and activity recognition. For more details about the dataset and the calibration process, readers are referred to [29].
We have relied on a single benchmark dataset to ensure a controlled and reliable evaluation of our proposed model. This decision is driven by several factors that limit the feasibility of using multiple datasets in the current study. There is currently a scarcity of publicly available datasets that provide fine-grained indoor location and human activity labels synchronized in time. Standard benchmarks such as PAMAP2 [31], Opportunity [32], and UCI HAR [33] focus solely on activity recognition. At the same time, datasets like UJIIndoorLoc [34] and IndoorLoc [35] are dedicated to localization without activity annotations, limiting their applicability for training or evaluating joint models. Additionally, existing datasets exhibit significant variability in sensor modalities (e.g., WiFi, BLE, IMU), deployment scenarios, room configurations, and sampling rates, which creates challenges for model transferability and alignment. Addressing this heterogeneity would require complex domain adaptation techniques, which fall outside the current study’s scope but represent an important avenue for future research. In this context, the EurValve SHiB dataset [29] stands out as it uniquely provides synchronized BLE-RSSI-based localization and wearable-derived activity recognition within a realistic smart home environment, enabling joint classification of activity-location pairs. This makes it particularly suitable for validating our proposed hybrid CNN-GRU-MRF architecture. This work aims to establish a robust methodological baseline capable of capturing spatial-temporal dependencies and inter-variable relationships, and using a single, well-characterized dataset ensures controlled benchmarking. It minimizes confounding variables that could obscure the methodological contributions.
4 Proposed hybrid CNN-GRU algorithm
The proposed hybrid CNN-GRU model combines convolutional neural networks (CNNs) for spatial feature extraction and gated recurrent units (GRUs) for temporal sequence modeling. This combination allows the complementary strengths of CNNs and GRUs to be used for efficient localization and activity classification. In this network, CNN layers extract hierarchical spatial features, while GRU layers capture temporal dependencies within the data. The final classification is achieved through a fully connected layer. The complete algorithm is given in Algorithm 1.
The CNN layers take the input data , where N denotes the batch size, T represents the temporal length, and F denotes the feature dimension. The convolution operation is mathematically defined as
where denotes the convolution operation applied to the input x, and b is the bias term. ReLU introduces non-linearity, making sure that the network has the ability to learn complex spatial patterns. Batch normalization followed by max-pooling is applied after each convolutional layer to stabilize training and reduce spatial dimension.
The temporal features extracted by the CNN are fed into the GRU layers. The GRU layers are bidirectional, capturing forward and backward dependencies in the temporal sequence. For each time step t, the GRU’s hidden state is updated using
where is the update gate controlling the incorporation of new information, and
is the candidate hidden state, computed as
Here, is the reset gate, Wh and Uh are learnable weights, and bh is the bias. The GRU effectively models temporal relationships, enhancing the sequence representation. The final classification is done through a fully connected layer with a softmax activation function given by
where and
are the weight matrix and bias term of the fully connected layer. The softmax function converts the logits into class probabilities. To ensure stable and efficient training, the model’s weights are initialized by using method introduced by Kaiming He et al. [36], defined as
where denotes the number of inputs to the layer. This initialization preserves the variance of activations across layers, preventing gradient vanishing or explosion.
During training, the model optimizes the cross-entropy loss function given by
where yi is the ground truth class label and is the predicted probability. The parameters are updated through a gradient-based optimizer with weight decay to avoid over-fitting.
Algorithm 1. Hybrid CNN-GRU model with He initialization and training procedure.
Input: Input data x, ground truth labels y, number of epochs E, learning rate , weight decay λ, CNN layers Lcnn, GRU layers Lgru, number of classes C.
Output: Trained model model.
Model Initialization:
Define the HybridCNNGRU model with input channels, CNN hidden dimension, GRU hidden dimension, and number of classes.
Initialize CNN layers:
for i = 1 to Lcnn do
{For each CNN layer}
end for
Initialize GRU layers:
for i = 1 to Lgru do
{For each GRU layer} Initialize GRU layer with bidirectional configuration.
end for
Define fully connected layer for classification.
Apply He initialization for weights:
where is the number of input units to the layer.
Training Loop:
for epoch = 1 to E do
Training Phase:
Set the model to training mode.
for each batch in the training dataset do
Reshape xi for CNN input.
Zero the gradients of the model parameters.
Perform a forward pass: output = model(xi).
Compute loss .
Perform backward pass: L.backward().
Update model parameters using the optimizer: optimizer.step().
Accumulate loss and compute accuracy.
end for
Evaluation Phase:
Set the model to evaluation mode.
for each batch in the test dataset do
Reshape xi for CNN input.
Perform a forward pass: output = model(xi).
Compute loss .
Update test loss and accuracy.
end for
end for
Envulate the performance of model on test data. Accuracy, Recall, F1 score, Precision
This approach classifies the activities and locations first separately by using a hybrid CNN-GRU model. Then, it carrys out the classification of combined activity-location pairs by relying on the same CNN-GRU architecture but integrating MRF into the architecture to further model dependencies in activity-location pairs the proposed diagram is shown in Fig 2.
This combination enables the model to harness the spatial-temporal modeling provided by CNN-GRU and probabilistic inference enabled through MRF, which guarantees a robust classification of the activity-location pair. The MRF models the joint probability distribution of activities and locations by incorporating their compatibility using an edge potential matrix. An MRF is a graphical model that represents a set of random variables with an undirected graph, where nodes represent the variables (activity and location in this case), and edges encode their pairwise dependencies. In this context, the joint probability for an activity-location pair is expressed as
where P(a) and P(l) are the individual probabilities of activity a and location l, that are obtained from the softmax outputs of the CNN-GRU models for activity and location classification. These probabilities are then combined using the MRF framework to compute the joint probabilities P(a,l). The most probable activity-location pair is inferred using
In (7) is the edge potential matrix representing the compatibility between a and l, and Z is the normalization constant ensuring the probabilities sum to one as
The edge potential matrix is used to encode the compatibility between activities and locations. It is empirically estimated from the training data by analyzing the co-occurrence frequencies of activities and locations. More specifically, a 2D matrix C is built in which each cell C(i,j) denotes the number of occurrences of the activity i with location j. The matrix is then normalized to compute the probabilities as given by
Here, denotes the total number of occurrences of activity i in all locations. This normalization ensures that the edge potentials are proper probabilities, summing to one for each activity.
To improve the description of the semantic and empirical correspondences between human activities and respective indoor spaces, we construct a compatibility matrix . where each element represents the conditional likelihood of a specific activity occurring at a particular location. It is empirically learned from the training set frequency statistics about how often each activity-location pair is seen.
The matrix is row-normalized for valid probability distributions over locations. The value for each activity adds up to one, for example. The model learns that lying is most affiliated with the bedroom (compatibility score: 0.90) while walking is most compatible with the stairs (0.70). Sit displays an even split between the kitchen and living room (0.40), as both are used for sitting activities.The compatibility matrix acts as the Markov Random Field (MRF) model’s edge potential function and is key in improving the joint activity-location inference. It regulates the model to make contextually plausible predictions and mitigate errors in ambiguity or noise by promoting high probability relations based on empirical patterns.We display this matrix as a heatmap Fig 3 to enhance interpretability, illustrating the learned relationships between activities and locations. This visualization assists in conveying the learned context-awareness of the model and aids in its improved comprehension of inferential mechanisms.
The MRF model in the proposed methodology operates by evaluating the trained activity and location classifiers in evaluation mode, the algorithm is given in 2.
Algorithm 2. Activity and localization model inference.
Input: Test data loaders ,
,
, trained models
,
, mrf model.
Output: Joint confusion matrix for activity-location pairs.
Model Initialization:
Generate Probabilities:
Do for each batch in and
:
for each batch do
Apply softmax to model outputs for activity and location.
end for
Generate Joint Probabilities using MRF:
Do for each combined data batch:
for each batch do
Compute activity and location outputs using softmax.
Compute joint probabilities with MRF.
Infer most probable activity and location.
Store true and predicted labels.
end for
Encode joint labels and compute confusion matrix.
Plot and Save Confusion Matrix:
Plot the confusion matrix and classification report.
5 Result and discussion
The proposed approach and benchmark methodologies are evaluated using a confusion matrix with thorough explanations for each strategy. A complete categorization report for the three scenarios is provided, and sensitivity and specificity are discussed in the subsequent sections.
5.1 Scenario 1
The confusion matrix for activity recognition shows that the model can identify activities with powerful temporal and spatial pattern learning and confusion matrix is given in Fig 4. Lying is most accurate 96.71% with few misclassifications. Sitting and standing perform well with 93.61% and 92.12% accuracy, respectively, though minor confusion occurs with adjacent activities. Walking is 91.02% accurate, often misclassified as standing or sitting.
The classification report of activity recognition is given in Table 1. Lying detects with low false negatives and 0.94 precision, 0.97 recall, and 0.95 F1 score. Sitting F1 score 0.94 suggests balanced accuracy owing to continuous precision and recall. However standing has 0.95 accuracy, 0.92 recall, and 0.93 F1 score, indicating more false negatives while walking exhibited equal accuracy and recall with an F1 score of 0.91, indicating reliable identification.
5.2 Scenario 2
The confusion matrix evaluates interior localization in the bedroom, kitchen, living room, and stairwell is given in Fig 5. The elevated diagonal accuracy bedroom: , living room:
indicates robust model predictions for certain regions. The few off diagonal errors like kitchen misclassified as bedroom:
indicate commendable accuracy with negligible overlaps. The matrix indicates that the proposed strategy is effective and recommends methods to enhance inter class uniqueness for improved localization accuracy.These findings are crucial for indoor intelligent system development.
The indoor localization classification report shows strong performance across four situations as given in Table 2. A precision of 0.96, recall of 0.95, and F1 score of 0.95 signify high accuracy with little false positives and negatives for bedroom. Kitchen achieved an accuracy of 0.92, a recall of 0.90, and an F1 score of 0.91, demonstrating dependable detection with potential for memory improvement. Livingroom achieved an accuracy of 0.93, a recall of 0.95, and an F1 score of 0.94, indicating proficient identification with slightly enhanced recall. Stairs exhibited steady performance with an accuracy of 0.94, a recall of 0.93, and an F1 score of 0.93.
5.3 Scenario 3
Activity recognition and indoor localization results are presented in this scenario. Fig 6 gives the confusion matrix of joint activity and localization. A Hybrid CNN-GRU is used to evaluate activities and locations separately, thereafter integrating the data using a MRF to ascertain joint probability. Insights from the joint activity-location confusion matrix about model performance and safety scenarios are essential. Recognizing hazardous events such as lying in kitchen 0.89 accuracy is essential, since they may signify a fall and need immediate emergency intervention. The accuracy of standing on stairs 1.0 demonstrates steadiness, whereas sitting on stairs 0.65 may suggest fatigue or pain. Walking in bedroom 0.83 is often acceptable however, errors in lying or sitting may suggest an anomaly. The lower accuracy for “walking on stairs” (0.59) indicates difficulties perceiving dynamic motion, increasing the danger of unexpected falls. Nonetheless, sitting in living Room 0.96 is readily identifiable. The consistency of activity and location predictions using MRF with Hybrid CNN-GRU enhances detection in high-risk situations and ensures indoor user safety.
The study on combined activity and location recognition classification demonstrates model intricacies as given in Table 3. Lying_Bedroom had a high F1-score of 0.83 and recall, while Lying_Kitchen attained 0.94 in precision. However Lying_Stairs exhibited a decreased F1 score 0.68 while maintaining excellent precision 1.00. Sitting_Kitchen and Standing_Livingroom had commendable F1 scores of 0.81 and 0.86, respectively, indicating precise detection. The data indicate that the model can classify activities and locations concurrently with variability across specific scenarios and conditions.
The high precision for Lying_Kitchen and Standing_Livingroom indicates robust classification accuracy. Reduced recall in Lying_Stairs and Standing_Kitchen indicates difficulties in remembering all pertinent events, perhaps signifying emergency risks.
The F1 scores effectively balance accuracy and recall, demonstrating strong performance in Lying_Bedroom and Sitting_Livingroom. In critical situations such as Walking_Stairs and Standing_Stairs, low scores indicate potential for enhancement in high-risk, transitional activities. Our results indicate that more refinement is necessary for successful activity-location identification, especially in safety-related scenarios.
5.4 Scenario 4
To demonstrate the effectiveness of the Markov Random Field (MRF) inclusion in the proposed framework, we perform the activity and location classification without MRF first, as shown in Table 4. The results indicate that although the multi-task CNN-GRU model achieves a reasonable level of effectiveness in the task of isolative activity classification (e.g., F1-scores of 0.91 for Sitting and 0.90 for Lying), its location classification is much poorer—especially in settings with semantic ambiguity or poor signal strength, such as Kitchen (F1-score: 0.097) and Bedroom (F1-score: 0.274), as compared to the results with MRF (shown in Table 3). This result indicates that joint learning of activities and locations without explicitly representing their relationships might cause misaligned or inconsistent predictions. In contrast, the proposed CNN-GRU+MRF model uses a learned edge potential matrix to capture contextual compatibility, which results in much better joint classification performance (e.g., Lying_Bedroom: F1 = 0.83, Sitting_Kitchen: F1 = 0.81, Standing_Livingroom: F1 = 0.86).
Combining Convolutional Neural Networks and Gated Recurrent Units (CNN-GRU) with MRF remedies a fundamental shortcoming in joint activity and localization models: the absence of contextual information. Although CNN-GRU successfully extracts spatial-temporal dynamics, the MRF module integrates probabilistic inference over activity-location correspondences through . This produces more semantically consistent and noise-resistant predictions, particularly in noisy real-world scenarios or overlapping pattern behavior. Generally, CNN-GRU+MRF proposes a better contextual and interpretable joint inference technique with an efficient application in innovative healthcare, ambient assisted living, and intelligent indoor environments.
6 Limitations
The following text highlights some of the limitations of the proposed work. Scalability to Additional Classes: Although our system demonstrates acceptable results with the present collection of four activities and four indoor locations, scaling to additional classes may bring more inter-class overlap and might call for retraining or fine-tuning the architecture. We recognize that the more activity-location combinations there are, the higher the model complexity and the data demands will be.
Sensitivity to Sensor Dropout: The system depends on continuous data from the gateway and wearable sensors. Sensor dropout or latency in communication can impact model performance, especially for real-time deployment. As a countermeasure, we plan to incorporate learning mechanisms of dropout robustness and data imputation techniques in future research.
Ambiguous Activity-Location Pairs: Some activity-location pairs—e.g., sitting in the kitchen or lying on stairs—are semantically ambiguous or naturally rare. Such instances can cause classification ambiguity. Our MRF model has mitigated this somewhat by capturing co-occurrences. Still, we acknowledge that context-dependent priors or multimodal fusion (e.g., vision or ambient sensors) are needed to disambiguate such instances further.
Generalization Across Environments: The new model has been tested with data from a particular smart home setup. The performance is promising in this restricted setting, but direct extrapolation to other layouts, building materials, and sensor positions in actual homes or hospitals may lead to less accurate results. Transfer learning and domain adaptation methods must be investigated in future research to tackle cross-environment robustness.
7 Conclusion
This paper proposed a hybrid CNN-GRU and MRF-based framework for the joint classification of indoor localization and human activities. The system leverages the spatial-temporal strengths of CNN-GRU for individual classifications and enhances the consistency of prediction by using the MRF framework. Experimental evaluations have shown substantial improvement in classification accuracy for both tasks with achieved precision and recall metrics befitting practical applications in healthcare and smart home systems. The proposed framework is especially applicable in real-world applications of smart homes, healthcare systems, and IoT-based environments where precise activity-location prediction is essential. Particularly, the combined activity-location classification pointed out the potential of integrating probabilistic dependencies for enhancing overall model reliability. These results verify the effectiveness of our approach and its applicability in environments requiring accurate and dynamic user context understanding. Future endeavors include augmenting datasets for variety, including modalities such as audio and ambient sensors, enhancing compute efficiency for edge devices, and investigating semi-supervised learning to diminish dependence on labeled data. The additions seek to augment the flexibility, scalability, and efficiency of the CNN-GRU-MRF framework for wider real-world applications.
References
- 1. Huang X, Wang F, Zhang J, Hu Z, Jin J. A posture recognition method based on indoor positioning technology. Sensors (Basel). 2019;19(6):1464. pmid:30917494
- 2.
Cheng L, You C, Guan Y, Yu Y. Body activity recognition using wearable sensors. In: 2017 Computing conference. 2017. p. 756–65.https://doi.org/10.1109/sai.2017.8252181
- 3. Golestani N, Moghaddam M. Human activity recognition using magnetic induction-based motion signals and deep recurrent neural networks. Nat Commun. 2020;11(1):1551. pmid:32214095
- 4. Xu W, Miao Z, Yu J, Ji Q. Action recognition and localization with spatial and temporal contexts. Neurocomputing. 2019;333:351–63.
- 5. Wang F, Feng J, Zhao Y, Zhang X, Zhang S, Han J. Joint activity recognition and indoor localization with WiFi fingerprints. IEEE Access. 2019;7:80058–68.
- 6.
Cheng L, Wu Z, Lai B, Yang Q, Zhao A, Wang Y. Ultra wideband indoor positioning system based on artificial intelligence techniques. In: 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI). 2020. p. 438–44.
- 7.
Cheng L, Chang H, Wang K, Wu Z. Real time indoor positioning system for smart grid based on UWB and artificial intelligence techniques. In: 2020 IEEE Conference on Technologies for Sustainability (SusTech). 2020. p. 1–7. https://doi.org/10.1109/sustech47890.2020.9150486
- 8.
Wang Y, Liu J, Chen Y, Gruteser M, Yang J, Liu H. E-eyes: Device-free location-oriented activity identification using fine-grained WiFi signatures. In: Proceedings of the 20th Annual International Conference on Mobile Computing and Networking. 2014. p. 617–28.
- 9.
Xie Y, Li Z, Li M. Precise power delay profiling with commodity WiFi. In: Proceedings of the 21st Annual International Conference on Mobile Computing and Networking. 2015. p. 53–64. https://doi.org/10.1145/2789168.2790124
- 10.
Kotaru M, Joshi K, Bharadia D, Katti S. SpotFi: decimeter level localization using WiFi. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. 2015. p. 269–82.
- 11. Gubbi J, Buyya R, Marusic S, Palaniswami M. Internet of Things (IoT): a vision, architectural elements, and future directions. Future Gen Comput Syst. 2013;29(7):1645–60.
- 12. Thakur N, Han CY. Multimodal approaches for indoor localization for ambient assisted living in smart homes. Information. 2021;12(3):114.
- 13. Wu J, Feng Y, Chang CK. LiLo: ADL localization with conventional luminaries and ambient light sensor. Electronics. 2022;11(16):2503.
- 14. Bock M, Moeller M, Van Laerhoven K. Temporal action localization for inertial-based human activity recognition. arXiv preprint 2023.
- 15. Zandi R, Behzad K, Motamedi E, Salehinejad H, Siami M. Robofisense: attention-based robotic arm activity recognition with wifi sensing. IEEE J Select Topics Signal Process. 2024.
- 16. Pagan J, Fallahzadeh R, Pedram M, Risco-Martin JL, Moya JM, Ayala JL, et al. Toward ultra-low-power remote health monitoring: an optimal and adaptive compressed sensing framework for activity recognition. IEEE Trans Mobile Comput. 2019;18(3):658–73.
- 17.
Z. Huang, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, Spatio-temporal inception graph convolutional networks for skeleton-based action recognition. inProceedings of the 28th ACM International Conference on Multimedia, 2020, p. 2122–30.
- 18. Haresamudram H, Essa I, Plötz T. Assessing the state of self-supervised human activity recognition using wearables. Proc ACM Interact Mob Wearable Ubiquitous Technol. 2022;6(3):1–47.
- 19.
Konak O, Wischmann A, van De Water R, Arnrich B. A real-time human pose estimation approach for optimal sensor placement in sensor-based human activity recognition. In: Proceedings of the 8th International Workshop Sensor-Based Activity Recognition and AI. 2023. p. 1–6.
- 20. Sun Z, Jiang H, Chen H, Cao Y, Qiu X, Wu Z, et al. ForgerySleuth: empowering multimodal large language models for image manipulation detection. arXiv preprint 2024. https://arxiv.org/abs/2411.19466
- 21. Guo X, Liu X, Masi I, Liu X. Language-guided hierarchical fine-grained image forgery detection and localization. Int J Comput Vis. 2024;1–22.
- 22.
Li Y, Cheng F, Yu W, Wang G, Luo G, Zhu Y. AdaIFL: adaptive image forgery localization via a dynamic and importance-aware transformer network. In: Proceedings of the European Confernace Computer Vision. 2025.
- 23. Alazeb A, Azmat U, Al Mudawi N, Alshahrani A, Alotaibi SS, Almujally NA, et al. Intelligent localization and deep human activity recognition through IoT devices. Sensors. 2023;23(17): 7363,
- 24. Ahmad T, Jin L, Zhang X, Lai S, Tang G, Lin L. Graph convolutional neural network for human action recognition: a comprehensive survey. IEEE Trans Artif Intell. 2021;2(2):128–45.
- 25.
Wang J, Zheng VW, Chen Y, Huang M. Deep transfer learning for cross-domain activity recognition. In: Proceedings of the 3rd International Conference Crowd Science Engineering. 2018. p. 1–8.
- 26. Ding J, Wang Y. WiFi CSI-based human activity recognition using deep recurrent neural network. IEEE Access. 2019;7:174257–69.
- 27. Ghosh A, Raha A, Mukherjee A. Energy-efficient IoT-health monitoring system using approximate computing. Internet of Things. 2020;9:100166.
- 28. Feng M, Meunier J. Skeleton graph-neural-network-based human action recognition: a survey. Sensors (Basel). 2022;22(6):2091. pmid:35336262
- 29. McConville R, Byrne D, Craddock I, Piechocki R, Pope J, Santos-Rodriguez R. A dataset for room level indoor localization using a smart home in a box. Data Brief. 2019;22:1044–51. pmid:30740491
- 30.
Pope J, McConville R, Kozlowski M, Fafoutis X, Santos-Rodriguez R, Piechocki RJ, et al. SPHERE in a box: practical and scalable EurValve activity monitoring smart home kit. In: 2017 IEEE 42nd Conference on Local Computer Networks Workshops (LCN Workshops). 2017. p. 128–35. https://doi.org/10.1109/lcn.workshops.2017.74
- 31.
Reiss A, Stricker D. 2012. Introducing a new benchmarked dataset for activity monitoring. In: 16th International Symposium on Wearable Computers. p. 108–9.
- 32.
Roggen D, Calatroni A, Rossi M, Holleczek T, Tröster G, Lukowicz P, et al. 2010. Collecting complex activity datasets in highly rich networked sensor environments. In: International Conference on Networked Sensing Systems. p. 233–40.
- 33.
Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL. 2013. A public domain dataset for human activity recognition using smartphones. In: ESANN. p. 437–42.
- 34.
Torres-Sospedra J, Trilles S, Montoliu R, Martinez-Usó A, Avariento J, Ramírez M, et al. UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: Proceedings of IPIN. 2014. p. 261–70.
- 35. Montoliu R, Torres-Sospedra J, Trilles S, Huerta J. An evaluation of the IndoorLoc platform for indoor positioning. Sensors. 2017;17(4):846.
- 36.
He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of IEEE International Conference on Computer Vision, 2015. p. 1026–34.