LLM-FMS: A fine-grained dataset for functional movement screen action quality assessment

Qingjun Xing; Xuyang Xing; Ping Guo; Zhenhui Tang; Yanfei Shen

doi:10.1371/journal.pone.0313707

Abstract

The Functional Movement Screen (FMS) is a critical tool for assessing an individual’s basic motor abilities, aiming to prevent sports injuries. However, current automated FMS evaluation is based on deep learning methods, and the evaluation of actions is limited to rank scoring, which lacks fine-grained feedback suggestions and has poor interpretability. This limitation prevents the effective application of automated FMS evaluation for injury prevention and rehabilitation. We develop a fine-grained, hierarchical FMS dataset, LLM-FMS, derived from FMS videos and enriched with detailed, hierarchical action annotations. This dataset comprises 1812 action keyframe images from 45 subjects, encompassing 15 action representations of seven FMS actions. Each action includes a score, scoring criteria, and weight data for body parts. To our extensive knowledge, LLM-FMS is the first fine-grained fitness action dataset for action evaluation task. Additionally, a novel framework for action quality assessment based on large language models (LLMs) is proposed, designed to enhance the interpretability of FMS evaluations. Our method integrates expert rules, utilizes RTMPose to extract key skeletal-level action features from key frames, and inputs prompts into the LLM, enabling it to infer scores and provide detailed rationales. Experimental results demonstrate that our approach significantly outperforms existing methods while offering superior interpretability. Experimental results demonstrate that our approach outperforms existing methods in terms of accuracy and interpretability, with a substantial increase in the clarity and detail of the rationales provided. These findings highlight the potential of our framework for fine-grained action quality assessment with the aid of LLMs.

Citation: Xing Q, Xing X, Guo P, Tang Z, Shen Y (2025) LLM-FMS: A fine-grained dataset for functional movement screen action quality assessment. PLoS ONE 20(3): e0313707. https://doi.org/10.1371/journal.pone.0313707

Editor: Ananth JP, Sri Krishna College of Engineering and Technology, INDIA

Received: October 29, 2024; Accepted: January 17, 2025; Published: March 11, 2025

Copyright: © 2025 Xing et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: This work presents a fine-grained dataset of FMS action keyframes, which can be accessed via the following Github repository: https://doi.org/10.6084/m9.figshare.c.7601630.v1.

Funding: This study has been supported by the National Natural Science Foundation of China, Grant No. 72071018, the Fundamental Research Funds for the Central Universities of China, No. 2023GCZX003 (Research on the Nonlinear Accurate Measurement System of Exercise Loads).

Competing interests: The authors have declared that no competing interests exist.

Introduction

In contemporary lifestyles, poor behavioral habits and lifestyle negatively affect individual physical function. This impact is particularly evident during physical activity, manifesting primarily as reduced exercise capacity [1,2]. The Functional Movement Screen (FMS) is a widely used, efficient tool in sports medicine for screening fundamental movement patterns. It quickly identifies deficiencies in these patterns, effectively reducing the risk of sports injuries [3,4]. FMS consists of seven fundamental movement patterns: deep squat, hurdle step, inline lunge, shoulder mobility, active straight leg raise, trunk stability push-up, and rotational stability. These movements assess flexibility, stability, and motor control. To assess symmetry and balance, movements are performed on both sides, totaling 15 test movements, as illustrated in Fig 1. The combination of these movements provides a comprehensive perspective for assessing an individual’s physical functionality.

Download:

Fig 1. The 15 movements of FMS.

From left to right, from top to bottom.

https://doi.org/10.1371/journal.pone.0313707.g001

The effectiveness of the FMS lies in its scoring system, which quantifies the quality and stability of individual movements [3]. The scoring system consists of four levels: 0, 1, 2, and 3 points. Higher scores indicate better movement proficiency and stability. It is important to note that if the subject experiences pain during the assessment, the movement is scored as 0. The remaining three levels reflect the degree of movement execution and stability, provided there is no discomfort or pain. FMS has been shown to aid in injury prevention and has broad applications in sports training, rehabilitation, and public fitness [4–6].

Traditional FMS assessment methods rely on experts’ on-site visual inspection and palpation. However, these methods have several limitations. First, the assessment process is time-consuming, and the scoring results may be influenced by the subjective experience of the evaluators. Second, there is a shortage of experts with extensive experience, and their distribution is geographically imbalanced, particularly in economically underdeveloped regions where access to professional assessment resources is limited [5,7,8]. Therefore, utilizing computer technology and artificial intelligence to automate and enhance functional movement screening is crucial for improving assessment efficiency, minimizing subjective bias, and broadening service coverage [9].

Recent studies have confirmed the feasibility of FMS objective evaluation automation. In 2014, Whiteside et al. [9] compared FMS scores assigned by human testers to those generated by a motion capture system. Their results demonstrated that, compared to manual scoring, the automated FMS scoring system provided more objective assessments, significantly reducing the influence of subjectivity. Hong et al. [10] proposed an automated FMS assessment method based on an improved Gaussian mixture model. This method extracted action features corresponding to different scores and trained a Gaussian mixture model, utilizing maximum likelihood estimation for scoring. The results showed high consistency with expert evaluations. Unlike the aforementioned methods that rely on manual feature extraction, Lin et al. [11] introduced an automatic FMS assessment method using deep learning. This approach employs the I3D network for video feature extraction, incorporating an attention mechanism and a multi-layer perceptron to capture spatio-temporal video features at multiple scales and levels, thereby improving the accuracy and reliability of the evaluation. Shen et al. [12] developed an FMS assessment framework based on a multi-view deep neural network (MVDNN). This framework combines automatic skeleton feature extraction with manual feature selection, extracting three-dimensional trajectory features of movements from two perspectives. By integrating automatic video analysis with deep learning, this framework can assess the quality of an individual’s FMS performance without the need for physical markers. Also in order to realize FMS automatic evaluation, Lin et al. [13] proposed an automatic assessment method for functional motion screening based on two-stream network and feature fusion. The proposed method uses RAFT algorithm to estimate the video optical flow, which can better capture the spatio-temporal characteristics compared with the single-flow method. At the same time, the method uses attention fusion to combine optical flow features with the original frame, which improves the prediction accuracy. The proposed framework outperforms the method in [11].

While the automated assessment methods ensure objective action scoring and effectively reduce the influence of expert subjectivity, there are still several problems. The datasets used in these studies contained only expert scores and lacked fine-grained annotations, limiting the ability of assessment methods to provide detailed scoring feedback. In addition, most deep learning-based models are black-box systems, rendering the decision-making process opaque. Consequently, users are unable to receive actionable feedback or recommendations, hindering the potential for subsequent training and improvement.

Large language models (LLMs) are a class of artificial intelligence models with powerful contextual understanding and reasoning capabilities. Recent studies have demonstrated the potential of LLMs in action analysis and motor control. Zhao et al. [14] leverage the contextual modeling and reasoning capabilities of LLMs, as well as the potential of multi-modal fusion, to propose a two-stage framework called AntGPT. It first identifies actions that have been performed in the observed video text data, and then asks the LLM to predict future actions through conditional generation, or infer goals and plan the entire process through thought prompts. This framework achieves the best performance on benchmarks such as Ego4D LTA and EPIC-Kitchens-55. Joublin et al. [15] proposed a hierarchical replanning architecture, which exploits the commonsense knowledge and implicit reasoning capabilities of LLMs to implement corrective replanning strategies in robot task planning, and verified the effectiveness of the architecture in simulated and real world environments. The study demonstrates the potential of LLMs to deal with physically grounded, logical, and semantic errors, and how feedback can be used to reevaluate and adjust plans in a timely manner.

We believe that LLMs are able to facilitate a fine-grained level of action quality assessment (AQA), and therefore constructed a new FMS evaluation dataset, LLM-FMS, which is the first fine-grained dataset to utilize LLMs for assessing action quality. LLM-FMS has two key features: (1) A three-level semantic structure. All keyframes of FMS are annotated with three levels of semantic tags: scores, scoring details, and body part information. The scoring details provide specific explanations of the action performance, enabling the model to better interpret action quality; (2) The dataset contains action keyframes and corresponding fine-grained annotation files, and its format is optimized for generating prompts for large language models.

This paper also proposes a framework that leverages the reasoning capabilities of LLMs to evaluate action quality on the LLM-FMS dataset (Fig 4). The framework first utilizes the open-source pose estimation tool, RTMPose [16], to extract skeletal data from FMS keyframes, followed by the extraction of action feature evaluation metrics based on predefined scoring rules. These evaluation results are then embedded into a prompt and fed into the LLM, which assigns scores to the FMS actions and provides detailed scoring feedback according to expert rules embedded in the prompt. To the best of our knowledge, this is the first method to apply LLM for the fine-grained automated assessment of FMS. In summary, the main contributions of this paper are as follows:

Download:

Fig 2. Human skeleton.

https://doi.org/10.1371/journal.pone.0313707.g002

Download:

Fig 3. Annotation tool.

https://doi.org/10.1371/journal.pone.0313707.g003

We constructed the first fine-grained dataset for FMS. Building on the publicly available FMS dataset by Xing et al. [17], we extracted keyframes from each action segment and performed a three-level semantic annotation, including action scores, scoring details, and body parts associated with scoring points, as illustrated in Table 1. This dataset represents the first fine-grained scoring resource for FMS action assessment.
We developed a model with strong interpretability based on LLMs. Unlike traditional black-box models, our method offers greater transparency, as the entire assessment process is grounded in well-defined expert rules and knowledge bases. This allows users to easily comprehend the model’s decision-making process. The enhanced transparency not only improves the model’s credibility but also increases its acceptability and potential for adoption in practical applications.
Fine-grained action assessment and improvement feedback. Our method not only provides a score but also accurately identifies specific action errors and offers targeted improvement suggestions based on expert knowledge. This approach delivers practical, actionable feedback to testers and athletes, enabling them to refine specific techniques and enhance overall sports performance.

Download:

Table 1. Three-layer semantic structure of the dataset.

https://doi.org/10.1371/journal.pone.0313707.t001

Methods

LLM-FMS fine-grained dataset

In this section, we introduce a novel fine-grained FMS action keyframe dataset, LLM-FMS. We will present details on its construction and statistical properties.

Dataset construction.

We extract RGB image data from the FMS multimodal dataset publicly available by Xing et al. [17]. Utilizing the RTMPose, we extracted the skeletal data for each frame, as illustrated in Fig 2, and calculated the cosine similarity according to rules to identify keyframe images from each action sequence. Finally, we enlisted experienced FMS experts to re-score the keyframe images and perform fine-grained annotations.

Dataset dictionary.

We constructed a fine-grained keyframe dataset characterized by a three-level semantic hierarchy, which includes action scores, scoring details, and body part information, as illustrated in Table 1. To facilitate this process, we engaged an experienced, FMS-certified expert to assist in establishing the rules and performing the three-level semantic annotation of the action keyframe images.

Regarding the semantic structure illustrated in Table 1, the scoring label denotes the specific score assigned to an individual’s action, while the scoring detail label provides specific information regarding the rationale behind that score. The scoring details vary across different actions. Additionally, the body part label identifies the primary body parts involved in the action, with the quality of limb movement in these areas directly influencing the score.

This study performs a fine-grained assessment of FMS actions utilizing the above three-level semantic hierarchy.

Dataset annotation.

We have developed a desktop annotation tool, illustrated in Fig 3, to enable FMS experts to conduct fine-grained semantic annotations of action keyframes more efficiently. Given a keyframe image, experts annotate each action using a predetermined dictionary. The annotation process consists of two stages, progressing from coarse to fine granularity. The coarse-grained stage involves marking the score for each action instance, while the fine-grained stage entails detailing the scoring aspects that contribute to the overall action score and recording the key body parts involved.

The entire annotation process is conducted by a single FMS expert to ensure consistency across all keyframe annotations. The total duration of the annotation process is approximately 40 hours.

Basic information of the dataset.

The LLM-FMS dataset comprises 1,812 keyframe images from 45 subjects, encompassing 15 distinct difficulty levels FMS actions, as illustrated in Fig 1. The demographic information of the subjects is in S1 Table. All subjects have signed informed consent and agreed to share their experimental data for scientific research. This study was also reviewed and approved by the Ethics Committee of Sports Science Experiment of Beijing Sport University (Approval number: 2021156H). Each action is associated with a score, several scoring details, and body part information. These data facilitate a comprehensive assessment of the flexibility of the individual’s left and right sides and allow for the examination of performance under varying difficulty levels. Table 2 presents detailed information about our dataset and compares it with existing AQA datasets for sports fitness and rehabilitation. Our dataset differs from existing AQA datasets in terms of granularity. For instance, datasets such as Fitness-28, UI-PRMD, and 3D-Yoga provide only action scores or ratings, whereas our dataset offers not only action scores but also fine-grained semantic annotations, including scoring details and body part information. Consequently, the absence of fine-grained semantic annotations in other datasets limits them to merely scoring or grading actions, thus precluding a comprehensive semantic-level action quality evaluation. To our knowledge, LLM-FMS is the first fine-grained dataset for LLM to evaluate the quality of fitness actions.

Download:

Table 2. Comparison of existing rehabilitation and fitness datasets with LLM-FMS.

https://doi.org/10.1371/journal.pone.0313707.t002

LLM-based assessment framework

In this section, we systematically introduce the AQA framework based on LLMs, which facilitates fine-grained evaluations of users’ actions and offers targeted improvement suggestions. The overall architecture of our method is illustrated in Fig 4.

Problem definition.

Given an action image sequence S and the corresponding scoring rules R_S and R_k, the proposed framework is formulated as a classification problem, assessing actions using a LLM to generate fine-grained evaluations of action quality. This can be represented as follows:

(1)

(2)

Here, represents the action image sequence, R_S represents the threshold discriminant condition required for action key frame extraction, C represents the keypoint cosine similarity matching function, which is used to extract action key frames k from the action image sequence. R_k represents the scoring rules of keyframe actions, F represents the scoring index extraction function of keyframe actions, and calculates the specific scoring index information of actions according to the scoring rules R_k. L represents a large language model. By embedding the scoring index information extracted in the previous step into prompt, with the help of LLM’s active logical reasoning ability, the score of the action } is given. is the truth score of the action keyframe.

Construction of knowledge rule base.

Based on the FMS scoring rules [3] and consultations with FMS experts, the keyframes of the seven FMS movements and their corresponding scoring criteria have been incorporated into the FMS movement assessment knowledge base. The specific process is as follows:

(1)Action keyframe selection.

Typically, FMS assessment evaluates the whole action time series. However, in this study, we focus on action keyframes. Therefore, FMS experts were consulted to establish threshold conditions for the angles and distances of 15 movements, in line with traditional FMS scoring criteria, see S1 Text. Based on these threshold conditions, the selected keyframes for the 15 movements are illustrated in Fig 1. It is important to note that, to facilitate the calculation of angle and position information for subsequent visual tasks, side views were chosen for the deep squat, hurdle step, and inline lunge, while front views were used for the remaining movements.

(2)Action rule definition.

This study employs manually extracted features from FMS keyframes of skeletal key points to conduct a fine-grained assessment of movement quality. These features are designed based on domain-specific knowledge and expert experience. Despite advancements in automatic feature extraction methods using deep learning, manually extracted features remain valuable for analyzing human skeletal behavior.

FMS experts standardized the scoring criteria for each movement keyframe in accordance with FMS scoring guidelines. These criteria include joint position, angle, and distance features, which are used to grade movement quality. The definitions of these features are outlined as follows.

Position features represent the relative positional relationships between joints.
Joint angle features denote the angles formed either between adjacent joints or between a joint and a reference coordinate axis.
Distance features describe the spatial distances between joints.

Additionally, FMS experts refined the identification of body parts critical to the execution of different movements. They then re-evaluated the scoring (ranging from 1 to 3 points) for each movement keyframe based on the dataset by Xing et al. [17], as illustrated in Fig 3. For the deep squat, the experts clearly defined the standard angle range between the trunk and lower leg, as well as the required hip height, along with relevant movement criteria. Detailed rules for other movements are provided in S2 Text.

Keyframe extraction.

This study focuses on extracting skeletal features from action keyframes for fine-grained movement assessment. Accurate identification of keyframes from image sequences is essential. First, the RTMPose is applied to extract the coordinates of 17 skeletal key points from each image in the action sequence, as shown in Fig 2. Next, the skeletal data are normalized, and the right hip joint point is selected as the origin of the skeleton coordinate system. The sum of the cosine angles between the vectors formed by other key points and the origin is calculated to represent the human posture. Finally, by comparing the Euclidean distance between the standard action frame and the sum of the cosine angles for each frame in the sequence, the action keyframe is identified. The detailed calculation process is as follows:

The right hip joint is selected from the human skeleton coordinates of 17 key points as the origin of the skeletal coordinate system. Subsequently, the sum of the cosine angles between the vectors formed by the other key points and the origin is calculated as follows:

(3)

Similarly, by calculating the sum of the cosine angles for the skeletal points in the standard action keyframe, denoted as Θ_N, the Euclidean distance between an image and the standard action keyframe is represented as follows:

(4)

Finally, the image with the smallest value is selected as the action keyframe, as follows:

(5)

For all 15 movements, standard action keyframe images have been established. Using the cosine similarity calculation method described above, we identified 1,812 keyframes across the action sequences for these movements.

Action scoring indicator calculation.

Based on the scoring indicators for each movement established in Section 4.2, we calculate the scoring indicators for the user’s movements to facilitate subsequent assessments. RTMPose is employed to extract the key points from the keyframes and compute the corresponding evaluation metrics based on these indicators and the extracted key point coordinates, as illustrated in Fig 5.

Download:

Fig 4. LLM-based fine-grained FMS action evaluation process framework.

https://doi.org/10.1371/journal.pone.0313707.g004

Download:

Fig 5. Calculation of action score index.

Taking deep squat as an example.

https://doi.org/10.1371/journal.pone.0313707.g005

For example, for the deep squat, it is essential to calculate the angle (θ_{5-11, 13-15}) between the trunk and the lower leg, which is defined by the vectors formed by key points 5 and 11 and key points 13 and 15. Additionally, the positional relationship between the hip joint and the knee joint must be assessed. This relationship is determined by comparing the y-coordinates of key points 11 and 13; specifically, if the y-coordinate of key point 11 is less than that of key point 13, the hip joint is positioned lower than the knee joint; otherwise, the hip joint is positioned higher. Similar calculations apply to other movements, where angles are computed directly. For positional relationships, coding is required to evaluate key points and subsequently output the positional information, facilitating reasoning in subsequent large models.

LLM prompt generation and AQA.

Following the processing steps outlined in Sections 4.2 to 4.4, to achieve fine-grained FMS using LLMs, it is also necessary to generate appropriate prompt for each movement. The design of prompts plays a critical role in the use of LLMs, as it directly influences the quality and relevance of the model’s output.

In this study, for each movement, prompts are generated based on expert-defined scoring rules and the output of angle or positional information from the user’s keyframes. The specific content of these prompts is provided in S3 Text. Once the prompt is generated, it is input into the LLM, which then produces a fine-grained score and offers improvement suggestions for the movement.

Results

Evaluation metrics

Building on previous work [10–13], we evaluate the model’s performance on FMS movements using predicted score accuracy, macro-averaged F1 (maF1), and the Kappa index. Additionally, we assess the interpretability of the proposed framework for fine-grained movement quality assessment.

Scoring accuracy.

Scoring accuracy quantifies the alignment between the predicted movement scores generated by the framework and the ground truth labels. In this study, scoring accuracy is calculated using the following formula:

(6)

Where N_normal is the number of samples for which the model prediction is correct, that is, the number of samples for which the model predicted action score is consistent with the expert rating. N_total is the total number of samples of FMS, that is, the number of all samples participating in the evaluation.

Macro-averaged F1 (maF1).

maF1 is employed to assess the accuracy of multi-class classification problems. In such scenarios, maF1 first computes the F1 score for each category and subsequently averages these scores. Specifically, if there are C categories, with each category’s F1 score denoted as , the maF1 score is calculated as follows:

(7)

Kappa coefficient (Cohen’s Kappa).

The Kappa coefficient measures consistency and serves as an indicator of accuracy. Unlike scoring accuracy, the Kappa coefficient accounts for model bias; specifically, when the sample sizes across categories are unbalanced, the model may disproportionately favor larger categories while neglecting smaller ones. The formula for calculating the Kappa coefficient is as follows:

(8)

Where represents the observed consistency, defined as the proportion of movement scores predicted by the model that match the true scores; this corresponds to the overall classification accuracy. Conversely, denotes the probability of random consistency, which is the expected proportion of movement scores predicted by the model that align with the true scores. The formula for calculating is as follows:

(9)

Where represents the number of true samples in category i, denotes the number of predicted samples in category i, and N signifies the total sample size.

Implementation and result analysis

Experimental setup.

We initially employed the RTMPose to extract human skeleton from FMS image sequences. Subsequently, we calculated the cosine similarity distance for each image against the standard action keyframe, identifying the image with the minimum distance as the keyframe. Following this, we computed the action scoring indicators for the keyframe based on the scoring rules established by experts. Finally, these indicators were integrated into the prompt to provide fine-grained scoring and explanatory feedback for the FMS movement. The overall framework of this study is illustrated in Fig 4.

Method comparison.

To measure the performance of the framework, we extracted the action scores of the framework output in S1 Code. We report in Table 3 the performance of the proposed framework when evaluating each action. Then in Table 4, the performance of the framework with FMS evaluation methods in other studies and the performance of different evaluation methods in other studies are compared. Overall, our proposed framework outperforms other research methods. Moreover, this study not only provides action ratings, but also provides a fine-grained interpretation of the ratings.

Download:

Table 3. Assessment performance of the framework on each movement.

https://doi.org/10.1371/journal.pone.0313707.t003

Download:

Table 4. Comparison of the framework with existing FMS assessment methods.

https://doi.org/10.1371/journal.pone.0313707.t004

Table 4 presents a performance comparison between the automated FMS evaluation method utilizing LLMs and other FMS evaluation methods based on machine learning and deep learning from extant studies, focusing on coarse-grained scoring. Our framework achieves an accuracy of 0.91, which is the highest among all compared methods, signifying a significant advantage in accurately scoring FMS actions. The maF1 score and Kappa coefficient of our framework are 0.87 and 0.82, respectively, slightly lower than those of the methods by Lin et al. [11] and the Dual-Stream Network [13], suggesting that our method may be more effective for imbalanced data or multi-class scoring tasks. The findings indicate that, in comparison to deep learning-based action quality assessment methods, the LLMs-based action quality assessment method, which integrates domain knowledge and expert rules, demonstrates comparable performance.

Discussion

Dataset

The granularity of a dataset, that is, the detail in data annotation, significantly impacts the research and application within the domain of AQA. Fine-grained datasets have significant benefits in AQA, as they provide more comprehensive annotation details, thereby enhancing the precision of evaluations and the interpretability of the models. Constrained by the level of granularity in existing datasets, the majority of current investigations into AQA remain focused on the estimation of action scores or action levels [23], and few studies carry out fine-grained AQA.

Coarse-grained datasets are relatively straightforward to collect and label. Currently, coarse-grained datasets can be categorized into three distinct groups based on the ground truth: standard action similarity evaluation, grade evaluation and regression-based score evaluation. Within the dataset presented in Table 2, the ground truth for Fitness-28 [18] is the canonical action samples of professional coaches. This dataset is comprised of 28 types of fitness actions, with a sample size totaling 7530 action videos. A depth camera is utilized to acquire depth data of actions. Employing such datasets, actions are assessed in a coarsely-grained manner through action similarity evaluation. The ground truth for UI-PRMD [21] is the grade evaluation, this dataset contains 10 healthy subjects repeatedly execute 10 distinct physical therapy actions 10 times, accumulating to 1000 action samples. These action samples are bifurcated into two classifications: standard actions and non-standard actions. The ground truth of 3D-Yoga [22] is the regression score, this dataset contains 117 types of yoga actions, with over 3792 action samples and 16668 key frames in total. This dataset provides hierarchical category labels and quality score labels for each action sample, which are adjudicated by three experienced yoga instructors, adhering to standardized yoga pose difficulty coefficients and completion scoring criteria developed by the instructor.

LLM-FMS can be categorized within the category of graded evaluation. In accordance with the FMS scoring guidelines, experts assessed the quality of action execution within keyframes ranging from 1 to 3, encompassing a total of three distinct levels; the higher the score, the higher the quality is deemed to be of the action completion within the subjects’ keyframes. Furthermore, experts were engaged to provide detailed annotations of action scoring features in adherence to FMS action evaluation protocols, as well as annotations indicating the body parts associated with the scoring features. Leveraging these two types of detailed semantic annotations, this study conducted an automated, detailed FMS evaluation with the aid of the contextual semantic reasoning capabilities of LLMs, addressing the limitation that, previously, only action scores could be assigned in prior research endeavors.

Constructing fine-grained datasets is an intricate process encompassing numerous challenges associated with data acquisition, processing, labeling, and ensuring data quality. The initial phase in building a dataset entails gathering action data. Currently, there are principally two avenues for acquiring raw data. One involves harvesting data from online media platforms, which is common in the field of sports competitions, such as FineGym [24] and FineDiving [23]. The alternative involves enlisting subjects to execute actions as required, which is common in sports fitness, physical rehabilitation and similar fields. Our dataset is classified within this latter category. This approach is not without its challenges, including the substantial expense associated with participant recruitment, the presence of participants who may lack the necessary qualifications, the complexity of managing the experimental milieu, and the stringent requirements of ethical scrutiny. To oversee the integrity of data acquisition, we engaged FMS specialists to preselect eligible candidates and to establish a dedicated testing site equipped with apparatus specifically for data gathering purposes. The second step is how to define fine-grained semantic labels. Our study proposes a two-tiered fine-grained semantic labels with reference to FMS evaluation rules. The third step is to annotate the data. The process of annotating data for AQA mandates individuals equipped with an adequate level of expertise, rather than simply relying on crowdsourcing platforms. We engaged FMS experts to conduct the annotation, thereby ensuring the stringent nature of the annotation process. Finally, we employ double checking to ensure the quality of annotations, since fine-grained annotations are more error-prone.

Comparison of methods

Deep learning-based methods, despite their powerful feature extraction and end-to-end learning capabilities, can automatically learn and extract complex action features from data without the need for manual feature design. However, the intricate and opaque internal decision-making mechanisms of these methods hinder their interpretability. The methodologies presented in [11] and [13] utilize I3D networks for the extraction of action features from videos, facilitating subsequent feature learning and scoring processes. The MVDNN model introduced by [12] integrates deep learning features with manually crafted features to extract FMS action characteristics from multi-view and multi-modal action skeleton data.

Conversely, our framework, which leverages domain knowledge and the contextual learning and logical reasoning capabilities of LLMs, demonstrates robust performance in the coarse-grained automated FMS scoring task. Furthermore, our framework provides fine-grained semantic interpretation, enabling the reasoning of scoring nuances and the identification of body parts that significantly contribute to the assessment, in accordance with established rules. It is important to note that this paper’s analysis is based on keyframe image datasets, unlike other studies that have employed video datasets, potentially leading to discrepancies in comparative results.

Limitations

To align with the logical reasoning capabilities of existing LLMs, the dataset proposed in this paper consists of action keyframe images. Future research can leverage the video processing capabilities of LLMs for fine-grained assessments based on FMS videos. It is important to note that fine-grained annotations necessitate manual decomposition and professional marking. Another limitation is the prompt designs relies on expert experience, and different prompting strategies and LLMs can lead to significant performance differences.

Future work

In the future, we will leverage the contextual learning ability and logical reasoning ability of LLMs to realize the fine-grained action quality assessment research of more diverse sports. On the other hand, the action evaluation research based on LLMs multi-modal fusion can be explored with the help of LLMs’ ability to understand the action behavior in images and videos.

Conclusion

In this paper, we propose the first fine-grained FMS action keyframe dataset, LLM-FMS, designed for the assessment of FMS movement quality using LLMs. To enhance the interpretability of the assessment framework, the dataset includes scoring detail annotation labels and body part information labels. Furthermore, we propose a fine-grained FMS quality assessment framework based on LLMs, leveraging their logical reasoning capabilities to improve the semantic interpretability of movement quality assessments. This approach renders the reasoning process more transparent and achieves significant advancements over existing AQA methods, moving beyond simple movement scoring or categorical grading.

Publicly accessible data

This work presents a fine-grained FMS dataset, which can be accessed via the figshare repository: https://doi.org/10.6084/m9.figshare.c.7601630.v1.

Supporting information

S1 Table. Demographic information of the subjects.

Age, sex, height, weight, and BMI of the subjects.

https://doi.org/10.1371/journal.pone.0313707.s001

(XLSX)

S1 Text. Threshold conditions for angle and distance of FMS action keyframes.

https://doi.org/10.1371/journal.pone.0313707.s002

(PDF)

S2 Text. Detailed rules for FMS movements.

https://doi.org/10.1371/journal.pone.0313707.s003

(PDF)

S3 Text. LLMs’ Prompt for FMS movements.

https://doi.org/10.1371/journal.pone.0313707.s004

(PDF)

S1 Code. Framework performance evaluation scripts.

https://doi.org/10.1371/journal.pone.0313707.s005

(PY)

Acknowledgments

The authors extend their gratitude to the numerous colleagues, students, and library and faculty staff who contributed to the FMS action sample collection in the service of this project. Additionally, appreciation is expressed to the Intelligent Sports Engineering Laboratory for furnishing the experimental facilities essential for this study. In particular, the authors extend their gratitude to Dr. Xuemei Li, Dr. Dapeng Bao, and Mr. Peng Zhang for their substantial contributions to the annotation of the dataset.

References

1. Hopkins WG, Marshall SW, Quarrie KL, Hume PA. Risk factors and risk statistics for sports injuries. Clin J Sport Med. 2007;17(3):208–10. pmid:17513914
- View Article
- PubMed/NCBI
- Google Scholar
2. Kallinen M, Markku A. Aging, physical activity and sports injuries. an overview of common sports injuries in the elderly. Sports Med. 1995;20(1):41–52. pmid:7481278
- View Article
- PubMed/NCBI
- Google Scholar
3. Cook G, Burton L, Hoogenboom BJ, Voight M. Functional movement screening: the use of fundamental movements as an assessment of function‐part 1. Int J Sports Phys Therapy. 2014;9(3):396.
- View Article
- Google Scholar
4. Letafatkar A, Hadadnezhad M, Shojaedin S, Mohamadi E. Relationship between functional movement screening score and history of injury. Int J Sports Phys Ther. 2014;9(1):21–7. pmid:24567852
- View Article
- PubMed/NCBI
- Google Scholar
5. Bennett H, Davison K, Arnold J, Martin M, Wood S, Norton K. Reliability of a movement quality assessment tool to guide exercise prescription (movementscreen). Int J Sports Phys Ther. 2019;14(3):424–35. pmid:31681501
- View Article
- PubMed/NCBI
- Google Scholar
6. Łyp M, Rosiński M, Chmielewski JP, Czarny-Działak MA, Osuch M, Urbańska D, et al. Effectiveness of the functional movement screen for assessment of injury risk occurrence in football players. Biol Sport. 2022;39(4):889–94. pmid:36247940
- View Article
- PubMed/NCBI
- Google Scholar
7. Venek V, Kranzinger S, Schwameder H, Stöggl T. Human movement quality assessment using sensor technologies in recreational and professional sports: a scoping review. Sensors (Basel). 2022;22(13):4786. pmid:35808282
- View Article
- PubMed/NCBI
- Google Scholar
8. Ressman J, Grooten WJA, Rasmussen-Barr E. Visual assessment of movement quality: a study on intra- and interrater reliability of a multi-segmental single leg squat test. BMC Sports Sci Med Rehabil. 2021;13(1):66. pmid:34099021
- View Article
- PubMed/NCBI
- Google Scholar
9. Whiteside D, Deneweth JM, Pohorence MA, Sandoval B, Russell JR, McLean SG, et al. Grading the functional movement screen: a comparison of manual (real-time) and objective methods. J Strength Cond Res. 2016;30(4):924–33. pmid:25162646
- View Article
- PubMed/NCBI
- Google Scholar
10. Hong R, Xing Q, Shen Y, Shen Y. Effective quantization evaluation method of functional movement screening with improved gaussian mixture model. Appl Sci. 2023;13(13):7487.
- View Article
- Google Scholar
11. Lin X, Huang T, Ruan Z, Yang X, Chen Z, Zheng G. Automatic evaluation of functional movement screening based on attention mechanism and score distribution prediction. Mathematics. 2023;11(24):4936.
- View Article
- Google Scholar
12. Shen Y-Y, Xing Q-J, Shen Y-F. Markerless vision-based functional movement screening movements evaluation with deep neural networks. iScience. 2023;27(1):108705. pmid:38222112
- View Article
- PubMed/NCBI
- Google Scholar
13. Lin X, Chen R, Feng C, Chen Z, Yang X, Cui H. Automatic evaluation method for functional movement screening based on a dual-stream network and feature fusion. Mathematics. 2024;12(8):1162.
- View Article
- Google Scholar
14. Zhao Q, Wang S, Zhang C, Fu C, Do M, Agarwal N. Antgpt: can large language models help long-term action anticipation from videos?. 2023.
15. Joublin F, Ceravola A, Smirnov P, Ocker F, Deigmoeller J, Belardinelli A, et al. editors. CoPAL: Corrective planning of robot actions with large language models. 2024 ieee international conference on robotics and automation (ICRA); 2024 13-17 May.
16. Contributors M. Openmmlab pose estimation toolbox and benchmark. 2020.
17. Xing Q-J, Shen Y-Y, Cao R, Zong S-X, Zhao S-X, Shen Y-F. Functional movement screen dataset collected with two azure kinect depth sensors. Sci Data. 2022;9(1):104.
- View Article
- Google Scholar
18. Li J, Hu Q, Guo T, Wang S, Shen Y. What and how well you exercised? An efficient analysis framework for fitness actions. J Visual Commun Image Representation. 2021;80:103304.
- View Article
- Google Scholar
19. Aguilar-Ortega R, Berral-Soler R, Jiménez-Velasco I, Romero-Ramírez FJ, García-Marín M, Zafra-Palma J, et al. Uco physical rehabilitation: new dataset and study of human pose estimation methods on physical rehabilitation exercises. Sensors. 2023;23(21):8862.
- View Article
- Google Scholar
20. Capecci M, Ceravolo MG, Ferracuti F, Iarlori S, Monteriu A, Romeo L, et al. The kimore dataset: Kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation. IEEE Trans Neural Syst Rehabilitation Eng. 2019;27(7):1436-48.
- View Article
- Google Scholar
21. Vakanski A, Jun H-p, Paul D, Baker R. A data set of human body movements for physical rehabilitation exercises. Data. 2018;3(1):2.
- View Article
- Google Scholar
22. Li J, Hu H, Li J, Zhao X (Editors). 3D-Yoga: a 3D yoga dataset for visual-based hierarchical sports action analysis. Proceedings of the Asian Conference on Computer Vision; 2022.
23. Xu J, Rao Y, Yu X, Chen G, Zhou J, Lu J. (Editors). Finediving: A fine-grained dataset for procedure-aware action quality assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022.
24. Shao D, Zhao Y, Dai B, Lin D. (Editors). Finegym: A hierarchical video dataset for fine-grained action understanding. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020.

[ref1] 1. Hopkins WG, Marshall SW, Quarrie KL, Hume PA. Risk factors and risk statistics for sports injuries. Clin J Sport Med. 2007;17(3):208–10. pmid:17513914
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Kallinen M, Markku A. Aging, physical activity and sports injuries. an overview of common sports injuries in the elderly. Sports Med. 1995;20(1):41–52. pmid:7481278
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Cook G, Burton L, Hoogenboom BJ, Voight M. Functional movement screening: the use of fundamental movements as an assessment of function‐part 1. Int J Sports Phys Therapy. 2014;9(3):396.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Letafatkar A, Hadadnezhad M, Shojaedin S, Mohamadi E. Relationship between functional movement screening score and history of injury. Int J Sports Phys Ther. 2014;9(1):21–7. pmid:24567852
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Bennett H, Davison K, Arnold J, Martin M, Wood S, Norton K. Reliability of a movement quality assessment tool to guide exercise prescription (movementscreen). Int J Sports Phys Ther. 2019;14(3):424–35. pmid:31681501
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Łyp M, Rosiński M, Chmielewski JP, Czarny-Działak MA, Osuch M, Urbańska D, et al. Effectiveness of the functional movement screen for assessment of injury risk occurrence in football players. Biol Sport. 2022;39(4):889–94. pmid:36247940
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Venek V, Kranzinger S, Schwameder H, Stöggl T. Human movement quality assessment using sensor technologies in recreational and professional sports: a scoping review. Sensors (Basel). 2022;22(13):4786. pmid:35808282
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Ressman J, Grooten WJA, Rasmussen-Barr E. Visual assessment of movement quality: a study on intra- and interrater reliability of a multi-segmental single leg squat test. BMC Sports Sci Med Rehabil. 2021;13(1):66. pmid:34099021
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Whiteside D, Deneweth JM, Pohorence MA, Sandoval B, Russell JR, McLean SG, et al. Grading the functional movement screen: a comparison of manual (real-time) and objective methods. J Strength Cond Res. 2016;30(4):924–33. pmid:25162646
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Hong R, Xing Q, Shen Y, Shen Y. Effective quantization evaluation method of functional movement screening with improved gaussian mixture model. Appl Sci. 2023;13(13):7487.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref11] 11. Lin X, Huang T, Ruan Z, Yang X, Chen Z, Zheng G. Automatic evaluation of functional movement screening based on attention mechanism and score distribution prediction. Mathematics. 2023;11(24):4936.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref12] 12. Shen Y-Y, Xing Q-J, Shen Y-F. Markerless vision-based functional movement screening movements evaluation with deep neural networks. iScience. 2023;27(1):108705. pmid:38222112
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Lin X, Chen R, Feng C, Chen Z, Yang X, Cui H. Automatic evaluation method for functional movement screening based on a dual-stream network and feature fusion. Mathematics. 2024;12(8):1162.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref14] 14. Zhao Q, Wang S, Zhang C, Fu C, Do M, Agarwal N. Antgpt: can large language models help long-term action anticipation from videos?. 2023.

[ref15] 15. Joublin F, Ceravola A, Smirnov P, Ocker F, Deigmoeller J, Belardinelli A, et al. editors. CoPAL: Corrective planning of robot actions with large language models. 2024 ieee international conference on robotics and automation (ICRA); 2024 13-17 May.

[ref16] 16. Contributors M. Openmmlab pose estimation toolbox and benchmark. 2020.

[ref17] 17. Xing Q-J, Shen Y-Y, Cao R, Zong S-X, Zhao S-X, Shen Y-F. Functional movement screen dataset collected with two azure kinect depth sensors. Sci Data. 2022;9(1):104.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref18] 18. Li J, Hu Q, Guo T, Wang S, Shen Y. What and how well you exercised? An efficient analysis framework for fitness actions. J Visual Commun Image Representation. 2021;80:103304.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Aguilar-Ortega R, Berral-Soler R, Jiménez-Velasco I, Romero-Ramírez FJ, García-Marín M, Zafra-Palma J, et al. Uco physical rehabilitation: new dataset and study of human pose estimation methods on physical rehabilitation exercises. Sensors. 2023;23(21):8862.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Capecci M, Ceravolo MG, Ferracuti F, Iarlori S, Monteriu A, Romeo L, et al. The kimore dataset: Kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation. IEEE Trans Neural Syst Rehabilitation Eng. 2019;27(7):1436-48.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref21] 21. Vakanski A, Jun H-p, Paul D, Baker R. A data set of human body movements for physical rehabilitation exercises. Data. 2018;3(1):2.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref22] 22. Li J, Hu H, Li J, Zhao X (Editors). 3D-Yoga: a 3D yoga dataset for visual-based hierarchical sports action analysis. Proceedings of the Asian Conference on Computer Vision; 2022.

[ref23] 23. Xu J, Rao Y, Yu X, Chen G, Zhou J, Lu J. (Editors). Finediving: A fine-grained dataset for procedure-aware action quality assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022.

[ref24] 24. Shao D, Zhao Y, Dai B, Lin D. (Editors). Finegym: A hierarchical video dataset for fine-grained action understanding. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020.

Figures

Abstract

Introduction

Methods

LLM-FMS fine-grained dataset

Dataset construction.

Dataset dictionary.

Dataset annotation.

Basic information of the dataset.

LLM-based assessment framework

Problem definition.

Construction of knowledge rule base.

Keyframe extraction.

Action scoring indicator calculation.

LLM prompt generation and AQA.

Results

Evaluation metrics

Scoring accuracy.

Macro-averaged F1 (maF1).

Kappa coefficient (Cohen’s Kappa).

Implementation and result analysis

Experimental setup.

Method comparison.

Discussion

Dataset

Comparison of methods

Limitations

Future work

Conclusion

Publicly accessible data

Supporting information

S1 Table. Demographic information of the subjects.

S1 Text. Threshold conditions for angle and distance of FMS action keyframes.

S2 Text. Detailed rules for FMS movements.

S3 Text. LLMs’ Prompt for FMS movements.

S1 Code. Framework performance evaluation scripts.

Acknowledgments

References