Semi-automatic tracking, smoothing and segmentation of hyoid bone motion from videofluoroscopic swallowing study

Motion analysis of the hyoid bone via videofluoroscopic study has been used in clinical research, but the classical manual tracking method is generally labor intensive and time consuming. Although some automatic tracking methods have been developed, masked points could not be tracked and smoothing and segmentation, which are necessary for functional motion analysis prior to registration, were not provided by the previous software. We developed software to track the hyoid bone motion semi-automatically. It works even in the situation where the hyoid bone is masked by the mandible and has been validated in dysphagia patients with stroke. In addition, we added the function of semi-automatic smoothing and segmentation. A total of 30 patients’ data were used to develop the software, and data collected from 17 patients were used for validation, of which the trajectories of 8 patients were partly masked. Pearson correlation coefficients between the manual and automatic tracking are high and statistically significant (0.942 to 0.991, P-value<0.0001). Relative errors between automatic tracking and manual tracking in terms of the x-axis, y-axis and 2D range of hyoid bone excursion range from 3.3% to 9.2%. We also developed an automatic method to segment each hyoid bone trajectory into four phases (elevation phase, anterior movement phase, descending phase and returning phase). The semi-automatic hyoid bone tracking from VFSS data by our software is valid compared to the conventional manual tracking method. In addition, the ability of automatic indication to switch the automatic mode to manual mode in extreme cases and calibration without attaching the radiopaque object is convenient and useful for users. Semi-automatic smoothing and segmentation provide further information for functional motion analysis which is beneficial to further statistical analysis such as functional classification and prognostication for dysphagia. Therefore, this software could provide the researchers in the field of dysphagia with a convenient, useful, and all-in-one platform for analyzing the hyoid bone motion. Further development of our method to track the other swallowing related structures or objects such as epiglottis and bolus and to carry out the 2D curve registration may be needed for a more comprehensive functional data analysis for dysphagia with big data.


Introduction
Oropharyngeal dysphagia is caused by various diseases such as stroke, Parkinson's disease, neuromuscular diseases, and head and neck cancer [1]. The prevalence of dysphagia is expected to increase taking into consideration an ageing population and the increase of incidences of diseases related with dysphagia [2,3].
Videofluoroscopic swallow study (VFSS) is considered to be the gold standard tool in the assessment of patients with dysphagia [4]. Most of the interpretation of VFSS in the clinical setting is qualitative or semi-quantitative and depends on subjective decision by an interpreter. Some clinicians or researchers are using temporal parameters (e.g. oral transit time, pharyngeal transit time) or kinematic parameters from motion analysis to overcome the qualitative nature of VFSS and to gain more data to classify the dysphagia, to predict the prognosis or to assess the treatment effect [5][6][7][8]. The hyoid bone is most commonly selected in kinematic analysis [9][10][11][12][13]. Both displacement and velocity of the hyoid bone excursion are associated with swallowing function and dysphagia; the maximum excursion and peak velocity of the hyoid bone motion are associated with bolus volume [14], the hyoid bone anterior displacement is reduced in patients with myopathy and irradiated nasopharyngeal carcinoma [5,15] and laryngeal elevation velocity was an independent predictor of aspiration in patients with acute ischemic stroke [16]. Therefore, the parameters from hyoid bone motion analysis provide some meaningful solutions in research or clinical practices. However, the classical manual tracking methods are labour intensive and impractical in real clinical practices [17,18].
To overcome this limitation, researchers have tried to develop a software to track the hyoid bone and to get the trajectory automatically. Recently, Patrick et al. [19] have reported their computer-assisted assessment of hyoid bone motion and found a high correlation between automatic tracking and manual tracking. This software can reduce the burdens for VFSS motion analysis and make further quantitative analysis practically possible. We developed a new algorithm and tried to resolve the limitations of the existing software. One limitation is to track the masked points. We have developed an algorithm in our software to track the masked point. Automatic smoothing and calibration have also been developed. Finally automatic segmentation of hyoid bone motion has been developed and added to the software, which is necessary for further quantitative motion analysis. The efficiency and accuracy of the automatic or semi-automatic process from hyoid bone tracking and smoothing to segmentation enables the motion analysis of VFSS to have a potential wide use in clinical practice and research. The software is validated in patients with stroke.

Materials and methods
Subjects and experimental design VFSS data and medical information for stroke patients were retrospectively reviewed from the database of VFSS movie files and medical records in Seoul National University Bundang Hospital. A total of 30 patients' data (mean age: 62.0 ± 11.4 yrs, 23 men and 7 women) were used to develop software (tracking, smoothing and calibration, and segmentation) and 20 loops from 17 patients' trajectories (10 loops from 8 unmasked trajectories and 10 loops from 9 masked trajectories) were used to validate the semi-automatic tracking method. One loop for each subject's trajectory of their hyoid bone was usually detected and extracted while two loops were obtained in subjects 5, 11 and 16. VFSS was tested in subjects with dysphagia after stroke with foods in various forms, including fluid, thickened fluid, a semi-blended food and boiled rice, which was the modified protocol of Logemann [20]. Each food was fed by spoon. The lateral projection of the VFSS taken during the 2-ml thin-fluid swallowing was used for software development and validation. VFSS were recorded at frequency of 30 frames per second.
One researcher performed the manual tracking and automatic tracking of hyoid bone from VFSS clips. When one type of tracking was performed, the tester did not consult the result of another type of tracking. Another researcher, who was blind to the results obtained by the first person, performed the manual tracking again and the data were used to assess the interrater reliability. The research protocol was approved by the Seoul National University Bundang Hospital institutional review board and was conducted in accordance with the regulatory standards of Good Clinical Practice and the Declaration of Helsinki (World Medical Association Declaration of Helsinki: Ethical Principles for Medical Research Involving Human Subjects, 2000).
Overview of the all-in-one platform for the motion analysis of hyoid bone Inspired by the method by Kellen et al. [19], in our study we specify a target point on the hyoid bone on one frame and then track the target automatically for the whole video sequence. The ROI window size by default should be large enough so that it is not very sensitive to the smaller movements of the hyoid bone. Each frame of the sequence is then processed to track the ROI centered at the target point across frames. The tracking of the ROI, partly masked by other objects such as the mandible, in some frames has been considered in our methodology. Furthermore, we consider the situation where the tracking process might collapse due to the existence of unidentifiable and invisible ROI in some frames. An automatic monitoring and indication mechanism has been added, enabling us to re-specify the target point and reset the window size of ROI and then resume the tracking process. In order to correct for the subject's head motion during process, a new coordinate system is defined via the anterior-interior border of the second and fourth cervical vertebrae across the entire procedure. Semi-automatic smoothing via a cubic smoothing are added for those target points in the hyoid bone and in the cervical vertebra for the sake of reducing tracking errors. Our platform also emphasizes the segmentation. After selecting one desired loop from the data, the automatic segmentation is carried out. By analyzing the first and second derivatives, a definition of splitting score is introduced and used for an automatic segmentation. This provides necessary and useful information for clinical assessment and further statistical analysis such as functional classification. The code for data tracking and for data pre-processing like semi-automatic smoothing, calibration, validation and segmentation is based on MATALAB (R2014a) and RStudio Version 0.99.484, respectively.
In summary, the main contributions of the present work include, (a) the development of a new algorithm based on the existing method by Kellen et al. [19] to track the masked part of the hyoid bone and a dynamic monitoring mechanic to fix the wrong-tracking problems in time; (b) the development of semi-automatic smoothing and calibration for reducing tracking errors; and (c) the development of a new method of automatic segmentation of hyoid bone motion, which could provide the researchers in the field of dysphagia a convenient, useful, and all-in-one platform for the analysis of hyoid bone motion.

Procedures of tracking
To define the template ROI, the user uses the mouse to identify any target point on the hyoid bone and then one square centered at it with default side lengths can be created automatically. The target point, together with this square (called ROI or template) will be tracked automatically frame by frame, utilizing the information from horizontal and vertical edge images calculated by Sobel edge operators [21]. The key point is to minimize the sum of the squared differences between the local edge characteristics in the templates and that in the images via rotation and shift over a suitable neighborhood. The best match to the template in the next frame can then be found. The tracking process can be iterated by updating the positions of both ROI and the target point.
When it comes to extreme situations, for example the ROI or the target point on the hyoid bone being covered by other objects like the lower part of the mandible, the tracking method [19] no longer works because the local edge characteristics of the ROI are heavily interrupted. So far there has been little research on addressing this problem.
Technically, the masked ROI refers to the ROI totally or partly overlapping with other objects, which often happens over just a few frames of the whole video sequences. In this case, it is hard to locate the target point even by eyes. The idea is to cut off the masked part in the current ROI. Besides the target point, it is necessary to track another two points along the edge of the lower part of the mandible. The locations of these two points should be flexible and the distance between them wide enough to guarantee that the line segment connected by them can approximate the lower edge of the mandible properly. We then check whether the current ROI crosses this line. The masked part above the line will be cut off if it crosses; otherwise, it will be treated as a normal case. A new ROI 0 can therefore be obtained using the normal tracking procedure (Fig 1). Another key point is the requirement of averaging the template matching error over the number of pixels within the new ROI 0 .
Assuming i 0 is the number of the reference frame, in which the template ROI(i 0 ) is produced, and i represents the current frame number, the template in the current frame is supposed to move with a rotation angleỹ and shift ðx;ỹÞ T . After this transformation, the removal of the hidden part of the template (Fig 1) is required. As a result, a new ROI 0 (i + 1) in the (i+1) th frame, which is not a square any more, is obtained. All the x coordinates and the corresponding y coordinates of the pixels within ROI 0 (i + 1) are saved in the vectors ROI 0 x (i + 1) and ROI 0 y (i + 1) respectively. The ROI(i 0 ) needs to be changed to ROI 0 (i 0 ) in the same way to make their local edge characteristics E x and E y comparable. The template matching error Dðx;ỹ;ỹÞ is then given by: where N(i + 1) represents the total number of the pixels within ROI 0 (i + 1). This equation is minimized by a global optimization method GA with a constrained search space [22]. In our implementation, the search constraints are: -2.5π/180 ỹ 2.5π/180, and À 5 x; ;ỹ 5. The tracking for the next position of the template and target point in this special case can then progress. However, the tracking process may become unstable if the ROI is totally masked by the mandible. In this rare case, we may estimate the underlying point by the points tracked in nearby frames or to track it manually.
In some rare situations, such as the hyoid bone moving suddenly or too fast, the target point is hardly recognizable (Fig 2). The optimal search is not applicable to those circumstances; therefore, a sensitive monitoring mechanic should be used to avoid possible wrong tracking. Kellen et al. [19] used a prediction model to initialize the new point position for the sake of improving tracking accuracy. We used a similar idea but for monitoring purpose in our package. The prediction model and displacement error for the next position of the target point are given by: where Δt is the inter frame period, i is the current frame number, H(i) and H(i + 1) are the hyoid bone's current and next position, and _ HðiÞ, H _ _ ðiÞ and H _ _ _ ðiÞ are respectively the first, second and third derivative of H(i). Given the previously acquired H(i), the Δ (absolute difference betweenHði þ 1Þ and the current H(i + 1)) should be less than a threshold (8 pixel units in our implementation). Otherwise, the tracking will be regarded as a failure. The software will automatically review the possible wrong-tracking frames backwards and forwards to identify the accurate sequence number of the first failure frame. After confirming it, we can manually adjust the window size of ROI(i), and go back to the tracking procedure by re-specifying the same target point in the ith frame by mouse click and continue the tracking process.

Smoothing and calibration
The tracking described in the previous subsections is based on the coordinate system of which the origin is located at the bottom-left corner of the image (image-based coordinate system). The problem is that there is usually a huge body or head motion which will blur the movement of hyoid bone during the swallowing process. Removal of this kind of error can be done by using a new patient-centric coordinate system [19,23]. Practically, it seems to be much more convenient and more efficient to define the new coordinate system that is just based on two special points defined in the following way. As described by the previous study [24], the y-axis of the patient-centric coordinate system is defined as a straight line connecting the anteriorinterior border of the fourth cervical vertebra to that of the second cervical vertebra. The xaxis is defined as a line vertical to the y-axis crossing the origin, C4, as seen in Fig 3B. The points C4ðx c 4 ; y c 4 Þ and C2ðx c 2 ; y c 2 Þ can be tracked at the same time using the methods illustrated in the previous subsections over the entire video sequences (Fig 3A). To reduce the tracking errors, smoothing is carried out for both the target point in the hyoid bone and the two tracking points in the cervical vertebra using a cubic smoothing spline. The degree of smoothing is controlled by the smoothing parameter, which can be adjusted by the operator to avoid over-fitting (Fig 3D). Then all the data is calibrated by the vertical distance from C4 to C2. Given those two points, the target point H(x,y) in the image-based coordinate system can be transformed to H 0 (x 0 ,y 0 ) in the patient-centric coordinate system by a simple rotation and shift. The formula is given by: The effect of smoothing on reducing tracking errors is demonstrated in Fig 3C. The calibration procedure aims to reduce the errors caused by head motion and make the data collected from different subjects comparable. Our later data analysis will be based on the trajectory after smoothing and calibration, of which the coordinate is based on cervical units (CU) (One cervical unit is defined as the distance (in pixel) between C2 and C4, i.e. |C2−C4|). Dividing one loop of the trajectory into certain phases is necessary for the assessment of the hyoid bone movement and useful for further statistical analysis. However, little research has been carried out in this area, particularly on automatic segmentation. In terms of the concept of phases, Isao Kaneko [22] performs a quantitative study dividing manually the movement into 5 phases:1 st elevation phase, 2 nd elevation phase, static phase and 1 st and 2 nd descending phase. Koichi Yabunaka et al. [25] have also done similar work on sonographic assessment of the hyoid bone movement during swallowing by segmenting the movement into 4 phases: Elevation, Anterior, Remain and Return. We developed a semi-automatic segmentation method. After manually selecting one complete desired loop from the entire raw trajectory, an automatic segmentation is conducted.
For simplicity, we use points (x(t),y(t)) to represent x and y coordinates after calibration and smoothing in the tth frame. Two ends of the manually identified complete loop are denoted by t A and t B . The technique is to acquire one desired time interval including one peak in y(t) and one valley in x(t) at the same time. These two points can be easily chosen by human eyes. For instance, the left panel of Fig 4A shows that it is easy to identify the peaks and valleys in these two marginal curves. The end points t A and t B are chosen such that the interval (t A ,t B ) contains both one peak in the upper curve (y(t)) and one valley in the lower curve (x(t)). Furthermore, the distance between (x(t A ),y(t A )) and (x(t B ),y(t B )) should be as small as possible (see left panel of Fig 4B). The ideal situation is that the distance equals to zero, i.e., the starting point A and ending point B overlap.
Once the desired loop is obtained, automatically dividing the movement into different phases over (t A ,t B ) can be carried out via analyzing the corresponding velocity amplitude . Specifically we can find the splitting points t from the following equations: Conditions (A) and (B) can guarantee in finding all the local minima in the velocity curve (see the curve in blue in the left panel of Fig 4C). Those local minima are interesting splitting points, which can be directly used in segmentation in most cases. The corresponding segmentation result is shown in the right panel of Fig 4C. In this case the three splitting points, as well as the start point and end point, are used to split the trajectory into four phases: elevation phase, anterior movement phase, descending phase and returning phase.
It is not uncommon that more than three candidate splitting points could be obtained based on the conditions (A) and (B) (Fig 5C), particularly for stroke patients. Therefore, we propose to identify the three best splitting points based on a new measurement, namely, splitting score. Assume there exist n−2 candidate splitting points: t 2 ,t 3 ,Á Á Át n−1 (in addition of the start point t 1 and end point t n ). For the ith splitting point t i (i 2 [2,n−1]), using the velocity amplitude v(t i ), we can define the following measures.
(a) Forward Splitting Score: FSSðt i Þ ¼ maxðvðtÞÞ À vðt i Þ; t 2 ½t iÀ 1 ; t i (b) Backward Splitting Score: (c) Splitting Score: Actually, the SS value can be regarded as the turning intensity for the candidate points, the larger, the better. Those t i 's with the top 3 Splitting Scores are chosen as the desired splitting points. Fig 5C shows that there are 6 candidate splitting points satisfying the conditions of (A)

Validation and statistical analyses
We tracked 20 loops from 17 subjects using our semi-automatic tracking methodology. Next, each swallow was also tracked manually by two independent trained observers, who were instructed to track one recognizable and fixed target point on the hyoid bone across all frames by clicking mouse. For the same swallow, we compare the two different trajectories tracked by automatic computer-assisted method and manually by one of those two observers. Similar to the previous study [19], we used Pearson correlation coefficients, as well as the range of motion (ROM) in the x-axis direction, y-axis direction and the whole 2D situation, to measure the difference. Relative errors (%) were used to measure the degree of agreement: jROM in automatic trackingÀ ROM in manual trackingj ROM in manual tracking × 100%. Apart from the raw trajectories (raw data without smoothing and calibration), five more comparisons were considered in our validation: RawNC (Raw data transforming to a new coordinate system without being scaled), RawNCC (Raw data with both coordinate system alternation and scaling), Smo (Raw data with smoothing), SmoNC (RawNC data with smoothing), SmoNCC (RawNCC data with smoothing). Continuous variables are presented as mean ± 1 SD. Parameters generated were compared between aspiration and non-aspiration groups using an independent t-test. Furthermore, in order to justify the current validation methodology, the Intraclass correlation and Pearson correlation are utilized to measure the inter-rater reliability between those two observers. We calculate those two measurements for both x coordinates and y coordinates of three points' locations in each frame by two raters' manual tracking. As mentioned before, those points are respectively located at the bottom left of the hyoid bone, anterior-interior border of the second cervical vertebra and that of the fourth cervical vertebra. Statistical analysis was performed using RStudio Version 0.99.484. Raw data used for the analysis are available from S2, S3 and S4 Files. Fig 7 show both computer defined and manual defined trajectories corresponding to six cases for two typical data sets, one from the unmasked group and the other from the masked group. Overall, two trajectories in each case match well (high Pearson correlation coefficients and low relative errors between manual tracking and automatic tracking in Table 1). The slight difference may be caused by different target points on the hyoid bone identified by the computer and the trained observer. Table 1 shows Pearson correlation coefficients and relative errors between manual tracking and automatic tracking, and range of motion from each tracking method. We can see that all of coefficients are in the interval between 0.942 and 0.991 (P-value<0.0001) and the relative errors in terms of x-axis, y-axis and 2D range of hyoid bone excursion range from 3.3% to 9.2%. Overall, the case of proper smoothing usually outperforms the other. Table 2 shows the average of Intraclass correlation and Pearson correlation coefficients for both x coordinates and y coordinates range from 0.995 to 0.999 (P-value<0.0001). It indicates a high consistency of quantitative measurements made by those two independent trained observers, which provides a justification of the methodological errors in our study.

Fig 6 and
According to our automatic segmentation method, most of the hyoid bone motion (14 out of 30 subjects) can be segmented automatically into four phases (Fig 8A and S1 File). For subjects 5, 7, 8, 22, 23 and 24, there is an error in automatic segmentation but the four phases can be segmented manually (Fig 8B and S1 File). For subjects 3, 9, 14 and 18, only 3 phases can be  Automatic tracking of VFSS motion segmented (no returning phase) (Fig 8C and S1 File). The trajectories of the remaining six subjects (number 4,15,20,26,29,31) are very abnormal and cannot be segmented to any typical phases (Fig 8D and S1 File).

Discussion
In this study with VFSS data from stroke patients, the trajectories tracked by our semiautomatic tracking method show high agreements (high Pearson correlation coefficient and low relative errors) with those obtained by the manual tracking (Table 1). In addition, the new automatic algorithm used to track the masked target points by the mandible also shows high agreement with the manual tracking (Fig 7 and Table 1). Our software has several potential advantages in tracking the hyoid bone, compared to the previous ones. First, the marker to track the hyoid bone [26] is not applied in this method. The marker attachment is uncomfortable for a patient, and different marker positions and detachments during swallowing might result in biased measurements. Second, we use the cervical spine height as the reference distance unit for calibration and demonstrate that this method works reliably in practice. The number of pixels used as the distance in the previous study [19], cannot represent the real distance because the real distance per one pixel can vary depending on the angle and distance from the x-ray source to the subject. Some studies use radiopaque markers such as coins [5] and it can more precisely calculate the exact distance. However, the marker attachment is not comfortable and perpendicular x-ray projection has to be warranted. In addition, it is recently recommended to use the normalized units (e.g. percentage of the distance between C2 and C4 vertebrae), in order to reduce the magnification artifacts or sex-based differences [10,27]. Therefore, the calibration method used in our software may be more suitable for further statistical analysis with large data. Third, the ability to track the masked point can save a lot of since no manual interfering is required during the automatic tracking. Fourth, our software can detect and trigger the automatic tracking failure and provides a chance to fix the problem manually. For example, in rare cases such as those shown in Fig 2, automatic tracking is not possible. Our algorithm can automatically notify the frame with error enabling save a lot of researcher's time. Fifth, we add the semi-automatic smoothing process. The software aims at capturing the target points separately in each frame from the VFSS data, resulting in the rough trajectory connected by those points. In order to reduce the measurable errors, a semi-automatic smoothing algorithm was applied. The automatically smoothed wave can be reviewed and adjusted by the operator. Overall, this semiautomatic algorithm provides a way to track the motion more precisely, conveniently and efficiently.
The motion of anatomical structures such as the hyoid bone and epiglottis are related with the severity of dysphagia or treatment effect [28,29]. Motion data from VFSS can be used for further studies using complex statistical methods, e.g. functional classification with machine learning or a prognostic model. Segmentation and registration are the further required process for the functional data analyses [30]. For instance, quantitative neuroimaging research often requires the segmentation of brain imaging data [31]. However, the manual segmentation is too time-consuming and is unreliable [32]. In the dysphagia research, segmentation process is applied in the data from accelerometer or acoustic signal [33,34] but no study has used the segmentation process in the VFSS motion 2D data. From one loop of hyoid bone motion trajectory and velocity curve, four phases can be divided automatically (Fig 5 and S1 File) using our software. Tracking various types of hyoid bone trajectories during swallowing in dysphagia patients and obtaining the characteristics and features after the segmentation process can help researchers for further study.
Our software needs to be further developed to improve the calibration process. Except the body and head motion, Emsudina et al. [32] also consider the effect of the position and the movement of the lower edge of the mandible on the hyoid bone motion. Those methods may improve the ability to accurately track the motion of an anatomic structure in the head and neck, but it requires a much more complicated system including more fixed points' position and particular directions in each frame while doing the calibration. It is not our main task of the current study but will be considered in our future work.
We also need to carry out 2D curve registration for the hyoid bone motion data. The deformations or displacements, termed phase variation, always arise in those curves, which can be shown through the different locations of splitting points while doing automatic segmentation (S1 File). The presence of phase variability often increases data variance and alters underlying data structures [35], we will use landmark registration or other methods, such as similarity index based registration method [36], SRV and FR Metric based multi-dimensional alignment method [37]. The tracking of other structures such as epiglottis and radiopaque bolus is also on ours. Those further developments can automatically make the output ready for further study using complex statistical methods. Using cloud computing systems, such as Microsoft Azure ML, Amazon Machine Learning or IBM Watson Analytics, can facilitate developing the dysphagia functional classification system or prognostic model with VFSS motion and clinical data, to assess new patients whenever the data is uploaded.