Automatic assessment of laparoscopic surgical skill competence based on motion metrics

The purpose of this study was to characterize the motion features of surgical devices associated with laparoscopic surgical competency and build an automatic skill-credential system in porcine cadaver organ simulation training. Participants performed tissue dissection around the aorta, dividing vascular pedicles after applying Hem-o-lok (tissue dissection task) and parenchymal closure of the kidney (suturing task). Movements of surgical devices were tracked by a motion capture (Mocap) system, and Mocap-metrics were compared according to the level of surgical experience (experts: ≥50 laparoscopic surgeries, intermediates: 10–49, novices: 0–9), using the Kruskal-Wallis test and principal component analysis (PCA). Three machine-learning algorithms: support vector machine (SVM), PCA-SVM, and gradient boosting decision tree (GBDT), were utilized for discrimination of the surgical experience level. The accuracy of each model was evaluated by nested and repeated k-fold cross-validation. A total of 32 experts, 18 intermediates, and 20 novices participated in the present study. PCA revealed that efficiency-related metrics (e.g., path length) significantly contributed to PC 1 in both tasks. Regarding PC 2, speed-related metrics (e.g., velocity, acceleration, jerk) of right-hand devices largely contributed to the tissue dissection task, while those of left-hand devices did in the suturing task. Regarding the three-group discrimination, in the tissue dissection task, the GBDT method was superior to the other methods (median accuracy: 68.6%). In the suturing task, SVM and PCA-SVM methods were superior to the GBDT method (57.4 and 58.4%, respectively). Regarding the two-group discrimination (experts vs. intermediates/novices), the GBDT method resulted in a median accuracy of 72.9% in the tissue dissection task, and, in the suturing task, the PCA-SVM method resulted in a median accuracy of 69.2%. Overall, the mocap-based credential system using machine-learning classifiers provides a correct judgment rate of around 70% (two-group discrimination). Together with motion analysis and wet-lab training, simulation training could be a practical method for objectively assessing the surgical competence of trainees.


Introduction
The traditional apprenticeship model of surgical education: "see one, do one, teach one", has now become less acceptable. Along with 1: the widespread dissemination of laparoscopic and robotic surgeries that necessitate specific surgical skills, 2: regulation of working hours, and 3: social demand for safer surgery, laboratory-based skill training has been utilized in a wide range of surgical disciplines. In the authors' previous study, a low-cost wet-lab model using cadaveric swine organs, including tissue dissection around the aorta and renal parenchymal closure, was developed, and training drills showed good construct validity [1]. Furthermore, a novel motion capture (Mocap) based measurement system that consists of 6 infrared cameras was developed. This system simultaneously tracked the movements of multiple surgical instruments, and identified the motion characteristics according to the level of laparoscopic surgical experiences in wet-lab training [2]. For example, in a tissue dissection task, a shorter path length and faster velocity/acceleration/jerk were observed for scissors and a Hem-o-lok applier in experts (�50 laparoscopic surgeries), and in experts with �100 cases, scissors moved more frequently in the close zone (0� to <2 cm from aorta) than those with 50-99 cases [3].
To ensure that trainees are ready to perform surgery, skills assessments are becoming more important, and they are traditionally performed manually by observing training tasks on site or video footage according to global skill assessment tools, such as "Objective Structured Assessment of Technical Skills (OSTAS)" or "Global Operative Assessment of Laparoscopic Skills (GOALS)", that usually markedly increase workloads of mentors [4,5]. Regarding automated assessment, several studies reported promising results. For example, Allen B et al. reported that in 30 participants (4 experts and 26 novices) performing the three drills of peg transfer, pass rope, and cap needle, instrument movements were captured by two electromagnetic sensors, and the support vector machines (SVMs) yielded >90% competency-prediction based on the motion metrics [6]. However, prior studies utilized very simple training drills such as "peg transfer", "pattern cutting", or "suturing" that involved artificial materials. Ideally, more complex drills should be included in credential processes before trainees perform actual surgery.
In the present study, in order to gain further insight into movement features of experts, data collection was expanded to include laparoscopic surgeons other than urologic surgeons. Using motion metrics of surgical instruments and several machine-learning techniques, we aimed to automatically assess surgical competence in porcine cadaver organ simulation training.

Materials and methods
This study was approved by the Ethical Review Board for Life Science and Medical Research, Hokkaido University Hospital (No. 018-0257). The initial results based on the current Mocap based measurement system among urologic surgeons, a junior trainee, and medical students (1 st data collection: n = 45, between December 2018 and February 2019) have been published [3]. In order to gain further insights into Mocap characteristics of experts and develop an efficient training model, the measurement experiments including general and gynecologic surgeons (2 nd data collection: between the end of May 2019 and September 2019) were extended.
In the 2 nd data collection, participants performed tissue dissection around a porcine aorta (Task 1) and renal parenchymal closure (Task 3), while needle driving in renal parenchyma (Task 2) was not performed because of the similar outcomes of Mocap metrics between Tasks 2 and 3 divided by the level of surgical experiences in the authors' previous study [3]. Overall, a total of 70 participants voluntarily took part in 89 training sessions of Tasks 1 and 3 during the total study period (19 participants overlapped between the 1st and 2nd data collections). Written informed consent was obtained regarding the use of their data for research.
The details of the present training tasks were previously reported [3]. In brief, porcine cadaveric organs were placed in a training box (Endowork ProII1, Kyoto Kagaku, Japan). During the training, one of the 4 authors (TA, MH, JF, and NI) assumed the role of a scopist, using an endoscopic camera system (VISERA Pro Video System Center OTV-S7Pro, Olympus, Japan). In Task 1, participants were asked to complete tissue dissection around the aorta, dividing encountered mesenteric vessels after applying Hem-o-lok. In Task 3, using a 15-cm 2-0 CT-1 VICRYL1 thread, participants were asked to make three square single-throw knots at 2 different sites on a kidney. All training was video-recorded for later analyses. Demographic data and prior experience of laparoscopic surgeries were also collected after the training.

Motion capture analysis
The details of the present Mocap based measurement system were previously published [2]. In brief, the measurement system simultaneously tracked multiple surgical instruments by 6 infrared cameras (OptiTrack Prime 41, NaturalPoint Inc., USA). Infrared reflective marker sets with a different pattern were connected to handles of each instrument so that they could be traced individually regardless of exchanges of instruments. The tip trajectory was calculated based on the position of the tip and handle. In order to reduce the noise, the track of the tip of instruments (x i , y i , and z i ) was smoothed with a Savitzky-Golay filter [7], and its derivatives ( d j x i dt j ; d j y i dt j ; and d j z i dt j j ¼ 1 to 3 ð Þ) were also calculated by the filter. In the 2 nd data collection, in order to measure the grasping force and position of grasping forceps, grasping forceps with strain gauges were utilized in Task 1, although it was not a focus of the current study.

Analysis and statistics
In order to characterize the motion features of surgical devices associated with laparoscopic surgical competency, motion metrics that represent kinematic features of the surgical instrument were calculated. S1 Table summarizes the definition of Mocap metrics. In addition to the metrics already reported in the authors' previous study [3], 10 metrics were newly calculated, based on hypotheses generated during the authors' video review process and previous papers: bimanual dexterity (BD), ratio of frequency of opening/closing both forceps (ROB), ratio of path length for both hands (RPLB), average distance between both forceps when opening/closing (ADBO), average distance between both forceps (ADB), depth path length (DPL), depth velocity (DV), average gripper rotation angle (AGRA), average attitude angle (Roll, Pitch, Yaw), angular length (AL, Roll or Pitch/Yaw), and working area (WA). For example, regarding ADB, the authors hypothesized that it would become closer in experts in the suturing/knotting task because of their efficient movements. Because of good depth perception, it was hypothesized that DPL would be shorter, and DV would become faster in experts was built in both tasks. Regarding WA, it was also hypothesized that WA of experts would become smaller than that of novices because experts manipulate surgical devices in the area close to the objectives in both tasks. Regarding AL-Roll, the hypothesis was that the sum of changes in the attitude angle of an instrument around the sheath axis would become smaller in experts due to better visual spatial ability.
The following is an outline of the present analyses: 1. Mocap-metrics were compared according to previous laparoscopic surgical cases (experts: �50 surgeries, intermediates: 10-49, novices: 0-9). The Kruskal-Wallis test was utilized to evaluate differences among the three groups. The Mann-Whitney U test was also utilized for paired comparison, if differences among the three groups were significant.
2. Using the Mocap-metrics with significant differences among the 3 abovementioned groups, principal component analysis (PCA) was performed, a data reduction technique, in order to identify the motion characteristics associated with surgical competency intuitively.
3. Finally, three machine-learning algorithms: Support Vector Machine (SVM), Principal component analysis-SVM (PCA-SVM), and Gradient Boosting Decision Tree (GBDT), were utilized for discrimination of the surgical experience level based on Mocap-metrics. The details of these algorithms are described in S2 Table. Before inputting all Mocap indices to these algorithms, robust Z-score normalization was conducted for scaling the data while reducing the effects of outliers. The robust Z score, z i , for data, x i , can be calculated as follows: Here, x m is the median for data x, and NIQR is the normalized interquartile range, calculated as NIQR = 0.7414�IQR (IQR = Interquartile range). The model was validated using nested and repeated k-fold cross-validation, which is a combined method of nested k-fold cross-validation and repeated cross-validation. This method enables robust verification that is not affected by randomness. S1 Fig shows that the data flow of the validation process. All procedures related to machine learning were done using Scikitlearn, a machine-learning library for python [8]. The machine-learning library "LightGBM" was also used to build a model of GBDT [9]. The accuracies of machine-learning models were compared by Friedman's test. The Wilcoxon signed rank sum test was also utilized to assess the differences in paired comparison. Friedman's test and the Wilcoxon signed rank sum test were performed using JMP 14 (SAS, Japan), and PCA was performed using R (Ver. 3.6.0). Table 1 shows a summary of participants' backgrounds. Urologic surgeons were dominant (n = 45), followed by gastroenterological surgeons (n = 9), medical students (n = 9), junior residents (n = 4), and gynecologic surgeons (n = 3). The experiences of laparoscopic surgery were: 0-9: n = 20, 10-49: n = 18, 50-99: n = 7, 100-499: n = 18, and �500: n = 7. As described above, 19 joined the training multiple times, which resulted in a total of 89 training sessions.

Results
S3 Table summarizes Mocap metrics according to previous surgical experiences. Overall, there were significant differences in speed-related metrics including velocity, acceleration, and jerk in scissors, Hem-o-lok clip applier, and bilateral needle holders, and significant differences in efficiency-related metrics including the operative time and path length in all devices among the three groups. These observations were in line with our previous study. Regarding the new metrics, for example, BD (in both tasks), DPL in grasping forceps, scissors, Hem-o-lok, and right/left needle holders, and AL-Pitch/Yaw in grasping forceps, scissors, clip applier, and right/left needle holders showed significant differences among the 3 groups.  In task 1, efficiency-related metrics (e.g., ROB, G_DPL, G_AL-Roll, and S_DPL) significantly contributed to PC1, and speed-related parameters in surgical devices manipulated by the right hand (e.g.,

PLOS ONE
Automatic assessment of laparoscopic surgical skill competence based on motion metrics PC3, PC4, and PC5, ROB (loading coefficient = 1.99), BD (loading coefficient = 2.08), and S_Working area (loading coefficient = -0.92) largely contributed to each PC, respectively. In Task 3 (Fig 2B), the operative time (loading value = 1.57), R_PL (loading coefficient = 1.32), L_PL (loading coefficient = 1.49), R_DPL (loading coefficient = 1.39), and L_DPL (loading coefficient = 1.58) largely contributed to PC1, which showed a significant contribution of efficiency-related parameters in both needle holders. Regarding PC2, L � j (loading coefficient = 0.93), L � a (loading coefficient = 0.84), L_High (loading coefficient = 0.81), L � v (loading coefficient = 0.78), and R � j (loading coefficient = 0.72) largely contributed, which showed the significant contribution of speed-related parameters, especially of a left needle holder. Table 2 show the performance results of each classifier under repeated and nested cross-validation, and comparative results for the three methods. Regarding the threegroup discrimination (experts vs. intermediates vs. novices), in Task 1, the GBDT method was superior to the other methods. GBDT methods resulted in a median accuracy of 68.6% (Fig 3A  and Table 2A). In Task 3, SVM and PCA-SVM methods were superior to the GBDT method (median accuracy of 57.4 and 58.4%, respectively, Fig 3B and Table 2A). There was no significant difference between SVM and PCA-SVM. Regarding the two-group discrimination (experts vs. intermediates/novices), GBDT methods resulted in a median accuracy of 72.9% in Task 1 (Fig 4A and Table 2B), and the PCA-SVM method resulted in a median accuracy of 69.2% in Task 3 ( Fig 4B and Table 2B).

Discussion
In order to gain further insight into movement features of expert surgeons and automated skill assessment, the data collection, including urologic, gastroenterological, and gynecological surgeons who regularly performed laparoscopic surgery, was continued. As previously reported, the strength of this Mocap-based measurement system compared with previous studies is that all surgical devices can be tracked because of the arrangement of infrared reflective markers, and, therefore, the proposed model can be utilized in complex training tasks that require a range of surgical skills. In the present study, additional motion metrics not included in the previous study were newly calculated. According to past reports [10,11], BD was newly calculated, and it showed significant differences among the three groups in either task. DPL (grasping forceps, scissors, right/left needle holders) and DV (scissors, Hem-o-lok, right/left needle drivers) also showed significant differences, being in line with the hypothesis that good depth perception of experts results in shorter DPL and faster DV. In terms of AL-Roll and Al-Pitch/Yaw, which means the sum of changes in the attitude angle represented as Euler angles of an instrument, AL-Roll showed significant differences in grasping forceps and scissors, and Al-Pitch/ Yaw in grasping forceps, scissors, Hem-o-lok, and right/left needle holders. Because Al-Pitch/ Yaw means the sum of the angle change in the vertical plane of surgical devices, it is considered that large AL-Pitch/Yaw would mean both frequent angle adjustments and inefficient movements by less-experienced surgeons. Regarding smaller Al-Roll (sum of angle change along surgical device axis) in experts, it is considered that experts have good spatial ability that results in fewer angle adjustments (rotating a surgical device itself), and/or they use their index fingers efficiently to rotate the shaft of a surgical device, which was not reflected in the outcome of Al-Roll. Because this measurement system records detailed location records (30 Hz) of multiple surgical devices simultaneously, it enables subsequent analyses based on "surgical expertise". In terms of PCA analyses, in a more generalized cohort including laparoscopic surgeons (urology, general surgery, and gynecology) and medical students, it is noted that efficiency-related metrics (e.g., ROB, G_DPL, G_AL-Roll, and S_DPL in Task 1, and operative time, R_PL, L_PL, R_DPL, and L_DPL in Task 3) significantly contributed to PC1, and speed-related metrics (e.g., S � v, C � v; S � a, C � a, S � j; C � j, S_DV in Task 1, and L � j, L � a, L_High, L � v, and R � j in Task 3) to PC2, being in line with a previous study.
Surgical skill evaluation facilitates surgical training and credentialing of competent surgeons. In this study, for automatic skill assessment, three classification methods: SVM, PCA-SVM, and GBDT, were evaluated. SVM has been frequently used for computer-based discrimination of surgeons' expertise [6,12,13]. Because of the many feature values utilized in the present SVM model, which might lead to a risk of overfitting to the original data, the PCA-SVM method was also utilized in this study. In this method, classification of SVM is conducted after reducing the data dimension by PCA, and it is expected to prevent overfitting. Regarding GBDT that uses an ensemble of decision trees for target label prediction, it has also been frequently utilized in machine-learning research [14][15][16]. As summarized in Table 2, GBDT showed the best accuracy in Task 1(3-group discrimination: median accuracy of 68.6%, 2-group discrimination: 72.99%), and PCA-SVM in Task 3 (3-group discrimination: 58.4%, 2-group discrimination: 69.2%). In addition to PCA-SVM, SVM also revealed the best accuracy in 3-group discrimination of Task 3, although PCA-SVM should be a suitable method because PCA-SVM revealed the best accuracy in both discriminations.
It is considered that Task 1 includes a range of required skills (tissue traction/dissection, Hem-o-lok use, and division of vascular pedicle) compared with Task 3 (suturing/knotting), which should have resulted in better discrimination results in Task 1. Regarding the outcomes of two-group discrimination (experts vs. intermediates/novices), accuracy of around 0.7 in both tasks was similar to that in the study by Oropesa et al., wherein 42 participants performed 3 box trainer tasks (peg grasping task, task that requires placing three elastic bands through their corresponding posts, and coordinated peg transfer task), kinematic data were captured by the TrEndo tracking system, and linear discriminant analysis (LDA), SVM, and an adaptive neuro-fuzzy inference system (AN-FIS) were utilized to classify 42 participants according to prior surgical experience (>10 laparoscopic surgeries performed vs. <10) by leave-one-out cross-validation. The mean accuracy observed was 71% with LDA, 78.2% with SVM, and 71.7% with AN-FIS. Regarding the misclassification in the current study, it may reflect the limited correlation between the previous caseload and actual performance. For example, active medical students would perform dry box training of suturing/knotting regularly, which would result in better kinematic outcomes, especially in Task 3. As inherent limitations of each machine-learning algorithm, the configuration, training, and validation process might influence the misclassification. An external cohort is also necessary to validate this model, and larger training data and refinement of the algorithm is necessary in order to improve computer-based skill credentialing. Nevertheless, this study showed promising results in terms of automated skill credentialing based on kinematic tracking data of surgical devices in wet-lab training. As another direction, in order to provide more user-friendly feedback, we developed a machine-learning-based GOALS scoring system based on Mocap metrics, using the current dataset and recorded movies [17]. GOALS is an already validated and widely-used assessment tool for grading laparoscopic surgical skills, and consists of five items: depth perception, bimanual dexterity, efficiency, tissue handling, and autonomy [18]. It was reported that this system could evaluate surgeons' skill with high accuracy (an error of approximately 1-2 points for a total score of 5-25 points). Taken together with the skill credential usage described in this paper, the authors believe that the motion data of instruments has promising value for surgical evaluation, which could provide valuable feedback to trainees, and mitigate the educators' workload.
Limitations of this study include the small sample size, lack of an external validation cohort, and limited correlation between previous case volumes and surgical skills, as abovementioned. Furthermore, in order to assess the educational benefit of Mocap analyses and computerbased objective skill assessment in simulation training, developing a computer program for onsite feedback to trainees is needed.

Conclusions
A Mocap-based credential system using machine-learning classifiers provides a correct judgment rate of around 70% (two-group discrimination). Together with motion analysis and wetlab training, simulation training could be a practical method for objectively assessing the surgical competence of trainees. The next challenge is to give objective feedback based on mocap metrics to trainees immediately on-site, which could become an educational means together with mentors' feedback. Nested k-fold cross-validation consists of two validation processes: Outer Cross-validation (Outer CV) and Inner Cross-validation (Inner CV). Each cross-validation was conducted using 10-fold cross-validation. In each validation, the dataset were divided into 10 groups; 9 groups were used to train the model, and 1 group was for testing. The accuracy of the model was evaluated by repeating this process 10 times so that all groups were evaluated. Note that Inner CV was conducted using training data of Outer CV, although Outer CV was conducted using the entire dataset. The grid search for hyperparameter tuning was done in Inner CV. The best parameter that showed the highest accuracy among all candidate parameters was used to build the model for the outer CV. In this study, nested k-fold cross-validation was repeated 100 times with different dataset divisions. (EPS) S1