Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Human swimming posture recognition combining improved 3D convolutional network and attention residual network

Abstract

Human swimming posture recognition is a key technology to improve training effect and reduce sports injury by analyzing and recognizing swimmer’s movement posture. However, the existing technical means cannot accomplish the accurate recognition of human swimming posture in underwater environment with high standard. For this reason, the study takes the 3D convolutional neural network as the model basis, and introduces the global average pooling and batch normalization to optimize its network structure and data processing, respectively. Meanwhile, full pre-activation residual network and three-branch structure convolutional attention mechanism are added to improve the feature extraction and recognition. Finally, a novel human swimming posture recognition model is proposed. The outcomes revealed that this model had the highest recognition accuracy of 95%, the highest recall of 93.26% and the highest F1 value of 92.87%. The lowest pose recognition errors were up to 4.7%, 4.9%, 2.1% and 6.6% for freestyle, breaststroke, butterfly and backstroke, respectively. The shortest recognition time was 6.78 s for the freestyle item, which minimized the recognition time and reduced the recognition error compared with the same type of recognition model. The new model proposed by the research shows significant advantages in recognition accuracy and computational efficiency. It can provide more effective support for recognizing athletes’ swimming posture for future swimming endeavors.

1. Introduction

In the field of sports biomechanics and artificial intelligence, human swimming posture recognition has emerged as a key area of research in recent years. For professional swimmers, even little posture adjustments can have a major impact on performance [1]. By accurately identifying and analyzing the swimming posture, it can help athletes to correct subtle errors in their movements, improve efficiency and reduce resistance. However, incorrect swimming posture may lead to long-term athletic injuries. By continuously monitoring the swimmer’s posture, timely detection and correction of improper movements can help to reduce the risk of injuries in the shoulder, lower back and other parts of the body [2]. Modern technology has advanced computer vision and deep learning (DL) techniques in particular, which has led to the creation of new tools and methods for swimming posture recognition. Giulietti N et al. found that existing video analysis techniques for swimming posture were difficult to resist the effects of bubbles, splashes and light reflections. For this reason this study proposed a novel markerless 2D swimmer posture estimation method after combining wearable sensors and swimmerNet network. The experimental results demonstrated that the method had an average error as low as 1 mm in recognizing the posture of athletes with different physical characteristics and swimming techniques [3]. To improve the efficiency of pose recognition for aerobic sports, such as swimming, Liu Q combined convolutional neural networks (CNNs) and long short-term memory (LSTM) and then proposed a CNN-LSTM recognition model. Experimental results indicated that this model provided higher recognition accuracy and robustness than the traditional model [4]. To overcome the limitations of existing underwater swim posture recognition technology, which were imposed by the wavelength of visible light, Wang et al. proposed a new recognition method that combined radius outlier removal and a PointNet network. This method was developed using data collected by a light detection and ranging system. Experimental results indicated that the highest index rate of this new method was 97.51% [5]. Changes in lighting conditions and also image quality degradation and occlusion can lead to limitations in the accuracy and stability of CNNswimming posture keypoint detection. For this reason, Xu B proposed a novel swinging posture recognition model after preprocessing the input image for enhancement. Experimental results indicated that the model performed well under different lighting conditions and image quality [6]. According to Chen L et al., there was still room for improvement in the efficiency of machine learning and DL methods for activity recognition in sports like swimming. Thus this study proposed a novel human swimming posture recognition model after combining reinforcement learning and inertial measurement units. The outcomes indicated that the balance accuracy of this new model for human back, waist, and upper and lower limbs posture recognition was 96.27% [7]. Wang Z et al. proposed a transformer dynamic fish detection method for underwater target detection by combining FishNet and Transformer models. The experimental results showed that the average accuracy of the method reached 83.2%, but its underwater detection robustness still needed to be improved [8].

Traditional swing posture analysis relies on the coach’s experience and frame-by-frame analysis of video footage. This is not only time-consuming and laborious, but also limited in accuracy. multidimensional CNNs, especially two-dimensional CNN (2D-CNN) and three-dimensional CNN (3D-CNN), have a natural advantage in processing video data. This is because they are able to capture both spatial and temporal information to better understand the dynamics of the human body during swimming [9]. Antillon D W O et al. proposed a diving gesture communication recognition model after combining 2D-CNN and support vector machine algorithms in an attempt to improve the human swimming posture recognition. After ten tests, the experimental findings indicated that the model’s accuracy and F1 value averaged between 0.95 and 0.98, which was better than the conventional recognition techniques [10]. Cao X et al. found that the robustness of swimmer pose estimation methods utilizing graph structures is poor. Therefore, this study proposed a human swimming key point detection model after combining multi-dimensional convolutional network and high resolution network. The outcomes indicated that the model achieved desirable results in swimmer’s pose estimation with high key point detection accuracy [11]. To increase the detection accuracy of intelligent underwater gesture recognition sensors, Fan L et al. used 3D-CNN with capacitive stretch sensors to create a novel swimming gesture recognition model. The outcomes revealed that the gesture recognition was accurate and efficient [12]. To help divers with underwater jobs, Liu T et al. developed a swimming posture identification technique utilizing 3D-CNN in DL algorithms. The outcomes displayed that the approach could effectively improve the posture recognition accuracy by 40% using underwater dataset and target tracking [13]. To enhance human posture recognition for underwater snorkeling for timely monitoring and emergency rescue, Rahman F Y A et al. proposed a real-time monitoring model after incorporating 3D-CNN. According to the trial findings, the model could identify snorkelers’ poses with up to 87.9% accuracy and a 0.4% loss rate [14]. Wu Y et al. introduced the Transformer model and vision Transformer (ViT) in order to improve the visual detection level of fish smart feeding, and proposed a visual detection model with improved Transformer. The experimental results showed that this model achieved better visual detection results with an F1 value of 94.13%. However, its effectiveness in complex environments needed improvement compared to the model proposed in this study [15].

In summary, despite significant advances in existing research on convolutional neural networks, spatio-temporal feature modeling, and attention mechanisms, most literature remains confined to descriptive summaries of algorithmic performance or single-scenario testing. There is a lack of systematic comparisons addressing model generalization, spatio-temporal robustness, and complex underwater interference factors. The significance of this research gap lies in the fact that models lacking the ability to distinguish complex environments and dynamic features across multiple swimming strokes will struggle to support movement optimization and injury prevention in sports science. This limitation also restricts the application expansion of artificial intelligence in real-world training monitoring scenarios. For instance, in practical swimming training, models unable to adapt to varying pool lighting conditions, bubble interference, or individual movement variations will struggle to accurately identify movement postures in real time. Consequently, they cannot provide reliable technical feedback to coaches or personalized corrective suggestions to athletes. Therefore, it is essential to conduct in-depth investigations into the performance differences among various models regarding spatio-temporal feature extraction accuracy, computational complexity, and resilience to underwater lighting interference. This will reveal the limitations and shortcomings of existing methods, provide interpretable recognition foundations for sports science, and lay the groundwork for highly reliable AI applications in sports analysis. Based on this, the study proposes a novel model that combines an improved 3D convolutional network with an attention residual network. By introducing global average pooling (GAP) and optimizing the structure with asymmetric convolutions, this model effectively reduces CC while enhancing overall efficiency. To further enhance spatio-temporal feature extraction capabilities, a full pre-activation residual network (Pre-ResNet) and convolutional attention mechanisms for three-branch structures (CAMTS) are adopted.

Unlike previous studies that primarily focused on single convolutional structures or single attention mechanisms, this research does not merely stack GAP, Pre-ResNet, and CAMTS as simple technical overlays. Instead, it achieves feature synergy and information flow integration among modules within the C3D framework, forming a novel model system with structural complementarity and functional coupling. Specifically, GAP extracts key features at the global level while reducing parameter complexity. Pre-ResNet enhances gradient propagation and deep feature learning stability through a normalization-then-activation approach. CAMTS enables cross-dimensional attention interactions between channel and spatial dimensions, thereby strengthening the model’s spatio-temporal feature extraction capabilities in complex underwater environments. The integration of these three components significantly enhances computational efficiency and recognition robustness while maintaining high accuracy.

The primary contributions of this research are threefold:

  1. (1) Establishing an improved C3D framework integrating GAP, Pre-ResNet, and CAMTS to achieve synergistic improvements in feature extraction accuracy and computational efficiency;
  2. (2) The introduction of cross-dimensional convolutional attention mechanisms effectively enhances the model’s stability and generalization capabilities in complex underwater environments, such as varying lighting conditions and bubble interference;
  3. (3) Systematic experiments across multiple datasets demonstrate that the model significantly outperforms comparable methods in recognition accuracy, runtime, and error control, providing a novel technical pathway for intelligent swimming posture recognition and sports injury prevention.

2. Methods and materials

Aiming at the needs and technical difficulties of human swimming posture recognition, the study introduces BN and GAP on the basis of C3D to reduce the model complexity and computation amount, and also adopts asymmetric convolution to improve the computational efficiency. In addition, in order to enhance the processing capability of spatio-temporal feature data, the study introduces Pre-ResNet. It optimizes the residual blocks (RBs) through the structure of normalization and activation before convolution. Meanwhile, CAMTS is integrated to further enhance the feature extraction and classification by extracting channel and spatial attention features.

2.1. Construction of improving convolutional 3D network

The development of computer vision technology has made it possible to extract human body postures from videos. By capturing images during swimming with a camera, the key points of the human body can be extracted using pose estimation algorithms for pose analysis [1618]. However, traditional 2D-CNN often fails to adequately capture the information in the time dimension when processing video data, resulting in a limited recognition effect. To address this problem, researchers have gradually turned to 3D-CNN with a view to improving recognition accuracy by capturing both spatial and temporal features [19]. However, the standard 3D-CNN still has challenges in terms of CC and model optimization, and needs further improvement and optimization. C3D is a DL model specifically designed for video data processing. It is able to capture both spatial and temporal features in video frames by performing 3D convolutional operations in both spatial and temporal dimensions [20,21]. Compared with the traditional 2D-CNN, C3D is able to effectively capture the time series information in the video through 3D convolutional operations. This significantly improves the understanding and analysis of dynamic processes. The structure of C3D is shown in Fig 1.

With eight 3D convolutional layers (CLs), five 3D pooling layers, two fully connected layers (FCLs), and one Softmax classifier, the structure of C3D is depicted in Fig 1 as being rather straightforward. Starting from the input 3-channel 16-frame video clip, it undergoes multi-layer 3D convolution and 3D pooling operations, then feature integration through FCLs, and finally the classification output is achieved through Softmax layers. Among them, the size of the convolution kernel (CK) of each layer is 3 × 3 × 3, and the size of the pooling kernel is 2 × 2 × 2. Although the fully-connected layer of C3D is able to integrate the useful feature information in the data in a more sophisticated way, and then output the best features. However, the number of parameters consumed during its operation is huge, thus causing the convergence of the whole network model to take a long time [22]. For this reason, the study tries to introduce GAP to replace the FCL. GAP does not require training parameters and at the same time changes the result of convolutional operation in a way of feature purification. Finally, it replaces the FCL by averaging all the values of each purified feature map (FM), thus significantly reducing the computation and parameters. Fig 2 displays the schematic diagrams (SD) of the FCL and the FCL following GAP replacement.

thumbnail
Fig 2. Schematic diagram of the FCL before and after GAP replacement.

https://doi.org/10.1371/journal.pone.0337577.g002

Fig 2(a) displays the SD of the FCL before GAP replacement. Fig 2(b) displays the SD of the FCL after GAP replacement. By averaging all of the values in each FM, GAP decreases the parameters and CC significantly by reducing each FM to a single value. On the other hand, in order to connect all of the FMs in the typical FCL, a significant parameters are needed. This increases the computation of the model and makes overfitting a common occurrence. GAP simplifies the feature dimension by averaging all the values of each FM, replacing the traditional FCL. GAP does not require additional training parameters, unlike the FCL. This reduces the CC of the model and prevents the overfitting problem. In addition, although the data feature extraction ability of C3D is enhanced compared to 2D-CNN, the amount of convolutional parameter operations in one more time dimension [23]. For this reason, in order to make C3D more lightweight, the study uses asymmetric split CK to adjust its convolution form. The schematic of the CK before and after the adjustment is shown in Fig 3.

thumbnail
Fig 3. Schematic before and after convolution kernel tuning.

https://doi.org/10.1371/journal.pone.0337577.g003

Fig 3(a) displays the SD of the merged convolution before adjustment. Fig 3(b) displays the SD of the adjusted asymmetric split convolution. The adjusted asymmetric split convolution significantly reduces the CC and the parameters by splitting the conventional CK into multiple small CKs. For example, splitting a large convolutional operation into multiple smaller convolutional operations results in less resources required for each computation. In network training, due to changing the structure of the FCL and convolutional kernel of C3D, changes in the data of the previous layer make the compatibility of the changes in the data of the later layer reduced. This may lead to a reduction in the processing speed of the subsequent Softmax classifiers. For this reason, the study introduces a BN approach to data normalization in order to reduce the improved module compatibility processing problem in C3D networks (C3DNs). In the underwater pose recognition task, the input data distribution can change at any time due to variations in lighting and unstable video quality. This phenomenon is known as “internal covariate shift”. BN stabilizes the data distribution and reduces gradient fluctuations during the training process by standardizing the input features (IFs) at each layer. This improves both the convergence speed and robustness of the model when dealing with complex underwater environments. It can maintain stable recognition performance under light changes and noise interference [24,25]. First, for each layer of IF x, the mean value of its small batch data is calculated as shown in Equation (1).

(1)

In Equation (1), and denote the mean and variance of the small batch data, respectively. m denotes the number of small batch data. denotes the i th sample. By normalizing the IF x using the calculated mean and variance, the normalized feature is obtained as shown in Equation (2).

(2)

In Equation (2), denotes the normalized feature. denotes the constant value. A linear transformation is performed on the normalized feature to obtain the final output. The computational formula for this process is shown in Equation (3).

(3)

In Equation (3), both and denote learnable parameters. In summary, an improving convolutional 3D network (IC3D) is proposed in the study. Fig 4 depicts the structure of the IC3D model.

In Fig 4, IC3D first preprocesses and segments each type of video into video frame images. Then the first 3 × 3 × 3 3D CLs is input to extract the integrated features. After extracting the features, it goes through the BN layer for BN processing to make the data distribution consistent. Then, it goes through a 3D pooling layer to remove redundant information and retain the timing information. After the initial feature extraction and pooling, the data enters the asymmetric 3D CLs and 3D point CLs. The spatio-temporal information is first extracted by 3 × 1 × 7 and 3 × 7 × 1 CLss, respectively. After normalization by BN layer, it is then input into 3D point convolution layer to fuse spatio-temporal information across channels, and finally output the result through Softmax. At this point, the IC3D convolution calculation is shown in Equation (4).

(4)

In Equation (4), is the value of position in the output FM of the th layer. is the value of position in the input FM of the layer. is the weight of the CK of the th layer at position . denotes the bias term of the th layer. GAP averages all the values across the FM and converts each FM to a single value as shown in Equation (5).

(5)

In Equation (5), , , and represent the height, width and depth of the FM, respectively. denotes the GAP result of the th channel. denotes the value of the FM at position and channel.

2.2. Model construction of human swimming posture recognition by fusing attentional residual networks

Spatio-temporal data contains two key features in human swimming posture recognition: spatial features and temporal features. The spatial feature primarily describes the static characteristics of the human body in each image frame. These characteristics include the shape, position, and distribution of the key points of the human posture. For example, the relative positions and action postures of a swimmer’s arms, legs, head, etc. can reflect the uniqueness of different swimming postures. Temporal features, on the other hand, capture the dynamic changes between consecutive frames, describing the temporal sequence and transition relationship of the action. For example, the temporal features of an action include the process from beginning to end and the change in stroke frequency and rhythm of a swimmer. Model accuracy is closely related to the extraction of these features. If the model can accurately capture and effectively differentiate these spatio-temporal features, it can recognize the subtle differences between different swimming strokes and improve the recognition accuracy. For example,the spatial features help the model recognize a swimmer’s pose at a given moment, while the temporal features help the model understand how the pose changes over time. This ensures continuity and coherence of the action. If the quality of both extractions is insufficient, it may lead to an increase in the error of pose recognition, thus affecting the overall accuracy of the model. However, after the structural improvement of the C3DN model, it is found that there are certain deficiencies in the processing of spatio-temporal feature data in human swimming posture recognition. Especially, the network degradation is easy to occur in the deep network. To enhance the processing capability of IC3D for spatio-temporal feature data, the study introduces Pre-ResNet. Dynamic feature extraction between video frames is difficult in underwater environments due to light refraction and changes in motion speed. For this reason, the Pre-ResNet architecture improves gradient mobility and alleviates the gradient vanishing problem in deep networks by first normalizing and activating, and then performing a convolution operation. Meanwhile, Pre-ResNet can better capture spatio-temporal features, especially in the case of rapid action changes and occlusion, and shows high recognition accuracy [26,27]. Fig 5 roughly depicts the original residual network and Pre-ResNet of IC3D [28].

thumbnail
Fig 5. Schematic structure of the original residual network and pre-ResNet.

https://doi.org/10.1371/journal.pone.0337577.g005

Fig 5(a) shows the original residual network structure of IC3D. Fig 5(b) shows the network structure of Pre-ResNet. Pre-ResNet adopts the structure of BN-ReLU-Conv-BN-ReLU-Conv in the RB. Moreover, the original residual network structure of IC3D is Conv-BN-ReLU-Conv-BN. In contrast, Pre-ResNet makes the input data of each layer normalized and activated before entering the convolution operation by advancing the BN and ReLU activation functions (AFs) before the convolution operation. This solves the problem of gradient vanishing during network training, while enhancing feature extraction and network stability [29, 30]. In this case, the BN and ReLU AFs are handled computationally as shown in Equation (6).

(6)

In Equation (6), denotes BN processing of IFs. denotes the ReLU AF. denotes the output features (OFs) after BN and ReLU processing. At this time, the convolution calculation and residual connection calculation formula for the th convolution are shown in Equation (7).

(7)

In Equation (7), and are the IFs and OFs of the th convolution. denotes the CK weight matrix of the th layer. denotes the bias of the th convolution. denotes the OFs of the residual connection. Considering the continuity of the video frames of human actions during swimming, the key frames in the actions of a segment often contain redundant frames when continuous actions are input into the IC3D [31,32]. To improve the recognition and extraction accuracy, the study introduces a lightweight convolutional attention machine. The mechanism consists of two main components, namely channel attention and spatial attention, which can adaptively assign higher weights to important features [33]. The channel attention mechanism focuses on extracting key features in the video frame. These features are relevant to the human posture. It also ignores distractions. Examples of distractions include air bubbles and reflections in the water. The spatial attention mechanism localizes and highlights the key action parts. This allows accurate recognition of the human posture. It works even when the video quality is poor. It also works when the viewing angle changes significantly. This approach performs well in several tasks, e.g., the CBAM proposed by Agac S, et al. [34] achieves significant improvement in image classification and target detection. Jiang M et al. [35] incorporates the CBAM in video action recognition, which dramatically improves the spatio-temporal feature extraction. In addition, the SENet proposed by Song et al. [36] improves the model’s expressive ability through channel attention. This demonstrates the convolutional block attention mechanism’s wide applicability in complex scenarios. However, existing convolutional attention mechanisms are limited when confronted with complex underwater environments. For example, they lack robustness in the face of rapid motion changes and dynamic backgrounds. To achieve the association between the channel and spatial dimensions, the study introduces the idea of cross-dimensional interaction in this mechanism and proposed CAMTS. The structure of CAMTS is shown in 6.

In Fig 6, CAMTS is divided into three branches. Among them, the left branch structure first inputs the C × H × W input tensor into the Z-pool layer (Z-PL). The Z-PLdecreases the channels of the input tensor to 2, which deduces the amount of computation. Next, the 2 × H × W tensor is input to the CLs and the BN layer to obtain the 1 × H × W tensor. Then it goes through the AF to produce the corresponding attention weights. In the middle branch structure, the three components of the input tensor are first adjusted to H × C × W and input into the Z-PL. After changing the number of channels to 2, the tensor of 1 × W × C is obtained through convolutional and BN layers. Finally, after the AF, the attention weights are obtained. For the branch structure on the right, the order of the three components is first adjusted to W × H × C. The subsequent operations are consistent with the branch structure in the middle. Finally, the results obtained from these three branches are summed and averaged. Among them, the Z-PL expression formula is shown in Equation (8).

(8)

In Equation (8), and are maximum pooling and average pooling, respectively. The convolutional attention mechanism formula is expressed in Equation (9).

(9)

In Equation (9), and denote the output matrix of the previous layer and the one-dimensional channel weight matrix, respectively. denotes the matrix multiplication. The mathematical expression of the weight matrix is shown in Equation (10).

(10)

In Equation (10), and denote the Sigmoid AF and input matrix, respectively. denotes the GAP layer. and denote one-dimensional convolution with 16 CKs of 1 and local cross-channel interactive convolution, respectively. denotes the input matrix after GAP and global attention to form the FM. The cross-entropy loss function is used to judge the IC3D superiority after CAMTS optimization, and its formula is shown in Equation (11).

(11)

In Equation (11), and denote the probability distribution and the probability of the sample in category , respectively. then denotes the actual distribution of sample labels. In summary, the study combines IC3D and the optimization of its residual network with the extraction of action keyframes. Moreover, it proposes a novel human swimming posture recognition model, i.e., C3D-GAP-Pre ResNet-CAMTS. The operation flow of this model is shown in Fig 7.

thumbnail
Fig 7. C3D-GAP-Pre ResNet-CAMTS human swimming posture recognition process.

https://doi.org/10.1371/journal.pone.0337577.g007

In Fig 7, first, each type of video is preprocessed to segment the video during swimming into consecutive frame images. Second, the frame images are fed into the first 3 × 3 × 3 3D CLs for initial feature extraction. Spatio-temporal features are captured by multi-layer 3D convolution and 3D pooling operations. Subsequently, the BN layer normalizes the input, and the GAP layer is employed to infer the parameters and CC. Next, the data enter the asymmetric 3D CLs and 3D point CLs to extract spatio-temporal information and perform cross-channel fusion, respectively. After that, Pre-ResNet is introduced in the RB. The learning ability and recognition accuracy of the model are improved by first normalization and activation-then-convolution operations. Finally, CAMTS is introduced to extract channel and spatial attention features, and the results are output through Softmax.

3. Results

The research builds an appropriate testing setup to verify the impact of this innovative human swimming posture identification model on performance. First, the final model is validated by ablation test. Second, the same type of model is introduced for testing the accuracy, error rate and other redundant metrics. In addition, inter-model comparison tests are conducted with four real swimming posture video data to verify the real application effect and reliability of this new model.

3.1. Performance testing of the human swimming posture recognition model

Standard experimental equipment and parameters are selected, and Swim-Pose Dataset (SPD) and Human Swim Dataset (HSD) are used as the sources for realizing the data testing. One of them, SPD, contains video clips of multiple swim strokes and corresponding pose annotations. These videos include swimmers of different ages, genders and skill levels, providing rich pose information for algorithm training and testing. HSD has collected a large number of underwater and aquatic swimming videos covering a wide range of strokes such as freestyle, backstroke, breaststroke and butterfly. Each video is accompanied by detailed pose annotations, which facilitates the researchers to perform pose recognition and analysis. To ensure the model’s generalization and fairness, the study further analyzes the sample distribution across both datasets. The SPD dataset comprises 480 swimmers, with males accounting for 53% and females 47%, spanning ages 14–38. By training level, elite athletes constitute 35% and amateur swimmers 65%. The HSD dataset comprises 520 video samples corresponding to 512 swimmers, with males accounting for 55% and females 45%, aged between 16 and 40 years. The ratio of elite athletes to recreational swimmers is approximately 4:6. Both datasets exhibits relatively balanced gender and age distributions, effectively mitigating potential group bias in model training. Data preprocessing, i.e., data cleansing, data transformation, data integration, and data statute, has been performed on the above dataset data in a comprehensive manner. It is divided into training set (TrS) and test set (TeS) in the ratio of 8:2 for the integration training of the initial model. Table 1 displays the experimental setup and parameter configuration.

The hardware and software configurations as well as the settings of the network parameters for this experiment are given in Table 1. The submitted dataset information and experimental setup form the basis of the investigation. To confirm the functionality of its modules and test for ablation in the TrS, the final human swimming posture recognition model is put through its paces. Changes in the loss of the validation set are also monitored during the training process. Training is stopped when the validation loss no longer decreases for 10 consecutive iterations to avoid overfitting the model to the training data. The test results are shown in Fig 8.

thumbnail
Fig 8. Ablation test results of a novel human swim stroke recognition model.

https://doi.org/10.1371/journal.pone.0337577.g008

Fig 8(a) illustrates the ablation test results of the novel human swimming posture recognition model in the TrS. Fig 8(b) illustrates the ablation test results of the new human swimming posture recognition model in the TeS. In Fig 8(a), in the TrS, all modules of the new model show more excellent recognition accuracy. Among them, the C3D module alone can reach 73% recognition accuracy in the late stage of training. However, after the sequential introduction of the GAP module, Pre-ResNet module and CAMTS module, the C3D-GAP-Pre-ResNet-CAMTS can achieve up to 96% recognition accuracy for swim strokes, when the number of model iterations is as high as 600. In Fig 8(b), in the TeS, the test performance of the overall model is consistent with the TrS, and both show that the pose recognition accuracy increases with the iterations. When the iterations is 250, the swimming pose recognition accuracy of C3D-GAP-Pre-ResNet-CAMTS at this point is up to 95%. It can be concluded that all the modules of this novel model show positive effects in its recognition operation. The study introduces advance models of the same type as C3D for comparison using mean average precision (MAP) as a metric, e.g., two-stream 3D-CNN (TS-3D-CNN), residual 3D convolutional network (Res3D) and 3D group CNN (3D-GCN). The test results are shown in Fig 9.

The MAP test results under the SPD are displayed for each model type in Fig 9(a). The results of the various models’ MAP tests under the HSD are displayed in Fig 9(b). The MAP values of all four models exhibit a declining trend in Fig 9(a) as the number of samples rises. Compared to TS-3D-CNN, Res3D and 3D-GCN, the proposed model of the study can reach the stable MAP value the fastest, which is 0.63 at this time, and the number of test samples is close to 130. While the stable MAP values of TS-3D-CNN, Res3D and 3D-GCN in the TrS are 0.53, 0.54 and 0.56, respectively. A declining tendency in the TeS is also evident in the MAP values of the four algorithms in Fig 9(b). Stable mean percentage accuracy (MAP) values for TS-3D-CNN, Res3D, 3D-GCN, and the suggested model in TeS are 0.52, 0.53, 0.55, and 0.61, correspondingly. These figures show that the suggested model outperforms the more advanced and similar C3D model in terms of identification and detection. Precision, recall, F1 value, mean squared error (MSE), mean absolute error (MAE) are used as reference indexes. The TS-3D-CNN, Res3D, 3D-GCN, and the study of the proposed model are compared in the SPD drinking HSD dataset. The test results are shown in Table 2.

In Table 2, the performance of the proposed model of the study is significantly better than the other models on both datasets. On the SPD, this new model has the highest P of 93.28%, the highest R of 92.47%, the highest F1 of 92.87%, and the lowest MSE and MAE both of 0.01, while the other models perform below this. On the HSD, this new model continues to lead with a P of 92.23%, an R of 93.26%, an F1 of up to 92.74%, and MSE and MAE of 0.01 and 0.02, respectively. In contrast, 3D-GCN has an F1 value of 89.89%, with MSE and MAE of 0.02 and 0.03, respectively. It can be concluded that the study’s suggested model performs better than the other models across the board in the accuracy index test and does well in the error index. It proves its validity and reliability in human swimming posture recognition task.

3.2. Simulation testing of a human swimming posture recognition model

For validating the practical application of the proposed human swimming posture recognition model, the study randomly selects 4 types of more classical swimming posture video data from SPD and HSDs, namely freestyle, breaststroke, butterfly and backstroke. Each type of swimming posture contains at least 4 different video clips, and a single video is 25 frames per second. These 4 types of swimming posture video data clips are used as the dataset for comparative analysis for subsequent simulation tests. The 4 types of swimming postures are shown in Fig 10.

Fig 10 (a), (b), (c) and (d) displays freestyle video, breaststroke video action, butterfly video and backstroke video action, respectively. Combining the above four types of swimming actions and their poses, the study introduces more advanced models in the field of action pose recognition for comparison, such as LSTM, spatial temporal graph convolutional network (ST-GCN), and multi-scale temporal convolutional network (MS-TCN). Fig 11 presents the test findings.

thumbnail
Fig 11. Swimming posture recognition error measurement with different models.

https://doi.org/10.1371/journal.pone.0337577.g011

Fig 11 shows the recognition test results of four models for various types of poses under the SPD and HSD. In Fig 11(a), all four types of models show good recognition results in both types of test datasets, especially the research proposed model has the best performance. The quantitative data find that the LSTM model has the lowest recognition error of up to 2.5% for butterfly pose, while the highest pose recognition error for backstroke can be close to 8%. While the ST-GCN model has the lowest recognition error of 2.7% for butterfly pose and the highest recognition error of 7.1 for sport for backstroke. The MS-TCN has the lowest recognition error of 2.3% for butterfly and the highest recognition error of 8.5% for backstroke. In Fig 11(b), the lowest pose recognition errors of the proposed model in the HSD can be up to 4.7%, 4.9%, 2.1%, and 6.6% for freestyle, breaststroke, butterfly, and backstroke, respectively. It can be indicated that the proposed model of the study has significant recognition performance and robustness in many models. The study selects butterfly with high recognition rate and conducts 8 tests for each of the above models. Fig 12 displays the test results for time.

thumbnail
Fig 12. Comparison of recognition time of different models for butterfly swimming.

https://doi.org/10.1371/journal.pone.0337577.g012

Fig 12 shows the results of the computation time comparison between LSTM and ST-GCN and the proposed model. Fig 12(c) shows the result of computing time comparison between MS-TCN and the proposed model. In Fig 12, the gray circled line indicates the standard time, the blue circled line indicates the time used for the LSTM model, and the green circled line indicates the circled line of the proposed model under study. In Fig 12(a), during the recognition of butterfly pose by the four models, the trend of each recognition time change curve of the LSTM model is consistent with the standard time, but its difference from the standard time is maximum 3 s. In Fig 12(b), the recognition time trend of ST-GCN has some gap with the standard time trend, but its maximum time difference is 4.1 s. In Fig 12(c), the MS-TCN time difference can be as small as 2.1 s, but there is still some gap compared to the detection time of the study proposed model. The shortest detection time of the proposed model is 4.5 s, which is significantly shorter than the 13 s of LSTM, 12.2 s of ST-GCN, and 11.1 s of MS-TCN. The study is tested with Top-K accuracy (Top-K), effective number of recognition and average running time as metrics. Table 3 displays the test results.

thumbnail
Table 3. Test of pose recognition metrics for different models.

https://doi.org/10.1371/journal.pone.0337577.t003

Table 3 presents the recognition test results for various models across different swimming strokes, including freestyle, breaststroke, butterfly, and backstroke. The evaluation metrics include Top-K, effective recognition time, and average recognition time. In Table 3, the proposed model’s Top-K is higher than that of the other models compared in the four swimming postures, especially in the butterfly and backstroke tests. In these tests, the Top-K value reaches 93.46% and 93.28%, respectively, representing the best performance. Additionally, the average recognition time of the proposed model is significantly shorter at around seven s for all poses. This is at least four s faster than the LSTM, ST-GCN, and MS-TCN models. In terms of effective recognition, the proposed model recognizes at least seven times for each pose. This indicates that the model is more efficient and makes fewer redundant judgments. Overall, the proposed model demonstrates significant advantages in terms of Top-K, recognition efficiency, and recognition time. These results prove the model’s effectiveness and robustness in swimming pose recognition tasks. The study is cross-validated on an external dataset with four models for different water clarity. The results are shown in Table 4.

Table 4 shows that the proposed model exhibits high accuracy and F1 value under all water clarity conditions, especially under high clarity conditions. The model achieves an accuracy of 90.85% and an F1 value of 93.44% under high clarity conditions. In contrast, the transformer-based enhanced model and the ViT model shows reduced performance in low definition conditions, with accuracy rates of 88.42% and 88.13%, and F1 values of 87.63% and 86.41%, respectively. This reflects their weaker robustness in recognizing complex underwater environments. In addition, the model based on LLM enhancement has slightly higher accuracy. It has a 92.46% accuracy under some conditions. However, its F1 value is only 88.22%. This is lower than that of the proposed model. This indicates that there are deficiencies in feature extraction and classification effects.

4. Discussion

From the perspective of sports science, the C3D-GAP-Pre-ResNet-CAMTS model proposed in this study not only achieves significant improvements in algorithmic performance but also demonstrates potential application value in practical training and rehabilitation scenarios. First, by integrating multi-dimensional convolutions with attention mechanisms, the model achieves precise quantification of swimming stroke characteristics. This transforms posture assessment from subjective observation into objective computation based on key points and spatio-temporal features, thereby providing data support for developing personalized training plans. Coaches can leverage the model’s outputs—including motion trajectory curves, arm-stroke frequency, body tilt angle, and symmetry metrics—to quantitatively analyze athletes’ technical stability and movement continuity. This enables targeted adjustments to training load and pacing across different training phases.

Second, the model’s advantage in capturing temporal features enables real-time identification of technical deviations and signs of fatigue through dynamic changes in consecutive frame postures. For instance, when the model detects gradual decreases in shoulder entry angle or kick amplitude, or imbalances in breathing rhythm, it can automatically flag potential fatigue trends or technical deterioration. This assists coaches in early intervention to prevent the solidification of incorrect form.

Furthermore, research demonstrates the model’s stable recognition of movement postures in complex underwater environments, enabling proactive prevention of sports injury risks. By precisely identifying shoulder rotation range, lumbar twist angles, and body balance postures, the model can detect high-risk movement patterns during exercise and provide real-time feedback to training systems, thereby reducing the incidence of common injuries like shoulder impingement syndrome and lumbar muscle strain. Integrated with wearable sensors or video analysis platforms, this model holds future potential for embedding into intelligent training assistance systems, enabling automated monitoring and risk alerts during workouts.

In summary, the proposed model achieves breakthroughs in both recognition accuracy and computational efficiency. More significantly, it advances scientific and intelligent approaches to swimming training through quantifiable, feedback-driven methods, offering new practical support for the convergence of sports science and artificial intelligence.

5. Conclusion

Aiming at the problems of high CC and insufficient capture of key information in the recognition process, the study was conducted in an attempt to further enhance the effectiveness of human swimming posture recognition, to help athletes improve their skills and reduce sports injuries. Firstly, GAP and BN were introduced to restructure the classical C3D convolutional model for action recognition. Secondly, Pre-ResNet and CAMTS were added to optimize the feature extraction and data processing of the model. Finally, a novel human swimming posture recognition model was proposed. Experimental results indicated that the proposed model achieved a maximum swimming style recognition accuracy of 95% after 250 iterations. Compared to TS-3D-CNN, Res3D, and 3D-GCN, the proposed model converged to a stable MAP of 0.63 most rapidly, with the number of test samples approaching 130 at this point. On the SPD dataset HSD, the new model achieved a maximum p-value of 93.28%, a maximum R-value of 93.26%, and a maximum F1 value of 92.87%. Furthermore, testing on four classic swimming stroke video datasets revealed that the proposed model achieved the lowest recognition errors for freestyle (4.7%), breaststroke (4.9%), butterfly (2.1%), and backstroke (6.6%). The shortest recognition time for the four strokes was 4.5 s, significantly shorter than the 13 s for LSTM, 12.2 s for ST-GCN, and 11.1 s for MS-TCN. The model achieved a maximum Top-K value of 93.46% in butterfly stroke testing. It identified the minimum number of valid strokes in breaststroke, butterfly, and backstroke (7 strokes each), with the shortest average runtime of 6.78 s for freestyle events. In summary, the proposed model demonstrated superior performance compared to existing models across multiple evaluation metrics, validating its effectiveness and reliability for human swimming posture recognition tasks. However, this study did not account for the impact of underwater lighting conditions and water quality on the test data. Future research may explore the effects of varying lighting angles, intensities, and water quality on model performance to enhance the comprehensiveness of this investigation.

6. Limitations and future work

However, despite these achievements, the research still has several areas requiring further refinement. First, the relatively limited scale of the SPD and HSD datasets, with insufficient sample size and diversity, may impact the model’s generalization performance on larger datasets. Second, the model training and recognition processes rely heavily on high-performance computing equipment, limiting its real-time deployment in standard training venues and portable devices. Third, the model performs well under standard underwater lighting conditions. However, its effectiveness under multi-source illumination, dynamic reflections, and complex background interference requires further validation. Additionally, there is room for optimization in the model’s processing speed and stability when handling real-time video streams. However, this study do not consider how underwater ambient lighting and water conditions impacted the test data. To enhance the study’s comprehensiveness, future research could explore how different lighting angles, intensities, and water quality affect model performance.

Supporting information

References

  1. 1. Dong Z, Wang X. An improved deep neural network method for an athlete’s human motion posture recognition. IJICT. 2023;22(1):45.
  2. 2. Xia H, Khan MA, Li Z, Zhou M. Wearable Robots for Human Underwater Movement Ability Enhancement: A Survey. IEEE/CAA J Autom Sinica. 2022;9(6):967–77.
  3. 3. Giulietti N, Caputo A, Chiariotti P, Castellini P. SwimmerNET: Underwater 2D Swimmer Pose Estimation Exploiting Fully Convolutional Neural Networks. Sensors (Basel). 2023;23(4):2364. pmid:36850962
  4. 4. Liu Q. Aerobics posture recognition based on neural network and sensors. Neural Comput & Applic. 2021;34(5):3337–48.
  5. 5. Wang H, Wu Z, Zhao X. Surface and underwater human pose recognition based on temporal 3D point cloud deep learning. Sci Rep. 2024;14(1):55. pmid:38167475
  6. 6. Xu B. RETRACTED ARTICLE: Optical image enhancement based on convolutional neural networks for key point detection in swimming posture analysis. Opt Quant Electron. 2023;56(2).
  7. 7. Chen L, Hu D. An effective swimming stroke recognition system utilizing deep learning based on inertial measurement units. Advanced Robotics. 2022;37(7):467–79.
  8. 8. Wang Z, Ruan Z, Chen C. DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer. JMSE. 2024;12(6):864.
  9. 9. Chen L, Yan X, Hu D. A Deep Learning Control Strategy of IMU-Based Joint Angle Estimation for Hip Power-Assisted Swimming Exoskeleton. IEEE Sensors J. 2023;23(13):15058–70.
  10. 10. Antillon DWO, Walker CR, Rosset S, Anderson IA. Glove-Based Hand Gesture Recognition for Diver Communication. IEEE Trans Neural Netw Learn Syst. 2023;34(12):9874–86. pmid:35439141
  11. 11. Cao X, Yan WQ. Pose estimation for swimmers in video surveillance. Multimed Tools Appl. 2023;83(9):26565–80.
  12. 12. Fan L, Zhang Z, Zhu B, Zuo D, Yu X, Wang Y. Smart-Data-Glove-Based Gesture Recognition for Amphibious Communication. Micromachines (Basel). 2023;14(11):2050. pmid:38004907
  13. 13. Liu T, Zhu Y, Wu K, Yuan F. Underwater accompanying robot based on SSDLite gesture recognition. Appl Sci. 2022;12(18):9131.
  14. 14. Abdul Rahman FY, Kamaruzzaman AA, Shahbudin S, Mohamad R, Suriani NS, Suliman SI. Translating hand gestures using 3d convolutional neural network. IJARBSS. 2022;12(6).
  15. 15. Wu Y, Xu H, Wu X, Wang H, Zhai Z. Identification of fish hunger degree with deformable attention transformer. JMSE. 2024;12(5):726.
  16. 16. Akila K. RETRACTED: Recognition of inter-class variation of human actions in sports video. IFS. 2022;43(4):5251–62.
  17. 17. Morshed MG, Sultana T, Alam A, Lee Y-K. Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities. Sensors (Basel). 2023;23(4):2182. pmid:36850778
  18. 18. Huang X, Xue Y, Ren S, Wang F. Sensor-Based Wearable Systems for Monitoring Human Motion and Posture: A Review. Sensors (Basel). 2023;23(22):9047. pmid:38005436
  19. 19. Wang L, Su B, Liu Q, Gao R, Zhang J, Wang G. Human Action Recognition Based on Skeleton Information and Multi-Feature Fusion. Electronics. 2023;12(17):3702.
  20. 20. Xiao H, Li Y, Xiu Y, Xia Q. Development of outdoor swimmers detection system with small object detection method based on deep learning. Multimedia Syst. 2022;29(1):323–32.
  21. 21. Zhang J, Xu K, Zhao S, Wang R, Gu B. Automatic recognition of the neck–shoulder shape based on 2D photos. Textile Research J. 2022;92(23–24):5095–105.
  22. 22. Yang R, Wang K, Yang L. An Improved YOLOv5 Algorithm for Drowning Detection in the Indoor Swimming Pool. Applied Sci. 2023;14(1):200.
  23. 23. Cao Y, Ma S, Cao Y, Pan G, Huang Q, Cao Y. Similarity evaluation rule and motion posture optimization for a manta ray robot. J Mar Sci Eng. 2022;10(7):908–9.
  24. 24. Tseng S-P, Hsu S-E, Wang J-F, Jen I-F. An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification. JMSE. 2024;12(4):540.
  25. 25. Chen L, Hu D, Han X. Study on forearm swing recognition algorithms to drive the underwater power‐assisted device of frogman. J Field Robotics. 2021;39(1):14–27.
  26. 26. Comas-González Z, Mardini J, Butt SA, Sanchez-Comas A, Synnes K, Joliet A, et al. Sensors and Machine Learning Algorithms for Location and POSTURE Activity Recognition in Smart Environments. Aut Control Comp Sci. 2024;58(1):33–42.
  27. 27. Hameed Siddiqi M, Alshammari H, Ali A, Alruwaili M, Alhwaiti Y, Alanazi S, et al. A Template Matching Based Feature Extraction for Activity Recognition. Computers, Materials & Continua. 2022;72(1):611–34.
  28. 28. Nogales RE, Benalcázar ME. Hand Gesture Recognition Using Automatic Feature Extraction and Deep Learning Algorithms with Memory. BDCC. 2023;7(2):102.
  29. 29. Vásconez JP, Barona López LI, Valdivieso Caraguay ÁL, Benalcázar ME. Hand Gesture Recognition Using EMG-IMU Signals and Deep Q-Networks. Sensors (Basel). 2022;22(24):9613. pmid:36559983
  30. 30. Ramalingam B, Angappan G. A deep hybrid model for human-computer interaction using dynamic hand gesture recognition. Comput Assist Methods Eng Sci. 2023;30(3):263–76.
  31. 31. Jain R, Karsh RK, Barbhuiya AA. Literature review of vision‐based dynamic gesture recognition using deep learning techniques. Concurrency and Computation. 2022;34(22).
  32. 32. Gionfrida L, Rusli WMR, Kedgley AE, Bharath AA. A 3DCNN-LSTM Multi-Class Temporal Segmentation for Hand Gesture Recognition. Electronics. 2022;11(15):2427.
  33. 33. Abba Haruna A, Muhammad LJ, Abubakar M. An Expert Green Scheduling System for Saving Energy Consumption. AIA. 2022.
  34. 34. Agac S, Durmaz Incel O. On the use of a convolutional block attention module in deep learning-based human activity recognition with motion sensors. Diagnostics (Basel). 2023;13(11):1861. pmid:37296713
  35. 35. Jiang M, Yin S. Facial expression recognition based on convolutional block attention module and multi-feature fusion. IJCVR. 2023;13(1):21.
  36. 36. Song S, Zhang S, Dong W, Li G, Pan C. Multi-source information fusion meta-learning network with convolutional block attention module for bearing fault diagnosis under limited dataset. Structural Health Monitoring. 2023;23(2):818–35.