Investigating the impact of a motion capture system on Microsoft Kinect v2 recordings: A caution for using the technologies together

Microsoft Kinect sensors are considered to be low-cost popular RGB-D sensors and are widely employed in various applications. Consequently, several studies have been conducted to evaluate the reliability and validity of Microsoft Kinect sensors, and noise models have been proposed for the sensors. Several studies utilized motion capture systems as a golden standard to assess the Microsoft Kinect sensors, and none of them reported interference between Kinect sensors and motion capture systems. This study aimed to investigate possible interference between a golden standard (i.e., Qualisys) and Microsoft Kinect v2. The depth recordings of Microsoft Kinect sensors were processed to estimate the intensity of interference. A flat non-reflective surface was utilized, and smoothness of the surface was measured using Microsoft Kinect v2 in absence and presence of an active motion capture system. The recording was repeated in five different distances. The results indicated that Microsoft Kinect v2 is distorted by the motion capture system and the distortion is increasing by increasing distance between Kinect and region of interest. Regarding the results, it can be concluded that the golden standard motion capture system is robust against interference from the Microsoft Kinect sensors.


Introduction
In 2010, Microsoft, in cooperation with Prime Sense, introduced an RGB-D camera called "Kinect". Initially, the Microsoft Kinect was developed as a gesture-based game controller for Microsoft Xbox 360. This device is equipped with RGB and near-infrared (NIR) sensors and NIR projector. It represents the depth information of viewing areas based on a structured light principle [1][2][3]. Within a year, Microsoft released official drivers and a software development kit (SDK) for Kinect for non-commercial use [1]. Microsoft  SDK v1.8 [4] Open Kinect SDK [5], and OpenNI SDK [6] were developed to generate human joint skeleton in 3-dimensional space based on the captured information. Hence, the Kinect sensor is not only utilized as an RGB-D sensor but also as a natural user interface and frequently employed as a marker-less human motion tracking system in robotic applications [7,8], posture and daily activities [9][10][11], rehabilitation [12][13][14], virtual reality, and exergames [14][15][16]. Four years later, Microsoft improved the Kinect sensor capabilities and released the second generation of the Microsoft Kinect. The sensor's specifications in this generation were significantly enhanced and embedded with a Full-HD RGB camera; the main improvement could be summarized regarding advances in generating a depth map. The second generation of Kinect, called "Kinect Xbox One" or "Kinect v2" in the literature generates depth information of the scanned area based on time-of-flight (ToF) principle [17,18].
The Microsoft Kinect sensors SDKs represent an estimation of body joints in head, torso, upper and lower limbs in 3-dimensional space. Hence, many scientific studies have been conducted to evaluate the reliability and validity of the calculated skeleton joints [11,[19][20][21][22][23][24]. The majority of these studies employed marker-based motion capture systems as a golden standard. In general, these motion capture systems are equipped with several IR cameras with a built-in array of infrared LEDs illuminating retroreflective markers placed at strategic positions on the body. The cameras usually surround the area of interest. The tracking software collects data from all the cameras and estimates position of retroreflective markers based on triangulation. Therefore, IR retroreflective markers are mounted on the moving object of interest, and according to the marker position, the trajectory of the object moving in space is estimated.
Since all these systems are equipped with IR spectrum cameras and capture infrared images, interference between the Kinect sensors and motion capture systems is possible, which might result an added noise in the recordings.
A wide variety of skeleton and posture tracking algorithms are proposed for the Microsoft Kinect sensors such as Microsoft Kinect skeleton SDK, Open NI SDK, and several other custom algorithms [31][32][33][34]. The skeleton tracking algorithms for Microsoft Kinect sensors are employing the estimated depth maps as raw input data to generate the 3D skeleton output. These algorithms calculate the position of the predefined joints in 3D space by extracting body point clouds and segmenting body parts from the depth maps. Thus, distortion in estimated depth maps might lead to inaccuracies in the generated 3D skeleton. Therefore, in this study, distortion in the acquired depth images were the focus of our inquiry.
In this study, it was hypothesized that the motion capture system will not interfere with Microsoft Kinect v2 sensor depth recordings. Consequently, the primary aim of this study was to investigate the impact of passive and active interference (introduced by retroreflective markers and cameras) on the depth recordings of Microsoft Kinect v2.

Microsoft Kinect sensors
The first generation of Microsoft Kinect, which in this study is called "Kinect v1", emits a speckle pattern using infrared laser projectors. Kinect v1 estimates the depth map using a structured-light method. In this approach, the disparity of reflected speckle pattern in the captured image is compared with a reflected speckle pattern at the known distance. More details about Kinect v1 are available in [35][36][37][38]. Fig 1A illustrates an infrared record with the corresponding generated depth map of the seen RGB sight.
The second generation of Microsoft Kinect, which in this paper is called "Kinect v2", is equipped with higher resolution image sensors and wider horizontal and vertical field of view (FOV). The operational range and resolution of Kinect v2 are enhanced by using a time-offlight technique to estimate the depth (see Fig 1B).
The Kinect v2 laser projector emits square waveform NIR lights with known frequencies and receives the reflected lights from the object using a CMOS sensor with the very high sampling rate (2 GS/s). Apparently, by increasing the modulation frequency, uncertainty in depth is decreased, but this occurs at the cost of increased aliasing. Microsoft Kinect v2 employs two high modulation frequencies of 80 MHz and 120 MHz and a low modulation frequency of 16MHz to eliminate this ambiguity while acquiring less uncertainty in depth. In addition to the three-intensity modulations, phase reconstructions with 0˚, 120˚, and 240˚phase shift are utilized to construct depth information. Further information about Kinect v2 and ToF cameras are available in [18,39,40].

Marker-based motion capture systems
Motion capture systems are used for tracking fast human activities in 3-dimensions precisely, and the main principle of these systems is considered as the golden standard in motion tracking when assessing Kinect system performance. The marker-based motion capture systems (MB-MoCap) are generally divided into two categories: active marker and passive markerbased systems. However, both categories utilize a triangulation method to estimate the position of the markers in the covered space. The active markers emit light, while the passive markers reflect the emitted light from arrays of LEDs that are mounted on the cameras. A passive marker is often called a "retroreflective marker". It should be mentioned that the majority of passive marker motion capture systems are also compatible with the active markers. Table 1 compares the most common marker-based motion capture systems.

Emitted lights spectrum
In this section, a prior investigation on the spectrum of projected lights from Kinect sensors, Qualisys motion capture system and laboratory environmental light have been conducted. The emitted lights were recorded using an Ocean Optics 2000 spectrometer (Ocean Optics, Largo, USA) for each projector individually. The spectrometer is equipped with silicon detector, which can quantize the light intensity from 200nm up to 900nm. Fig 2 depicts the recorded spectrum using the spectrometer. Since the spectrometer recording spectrum was limited up to 900nm, the right side of projected beams for Qualisys camera was estimated based on expected distribution.
The pre-analysis showed a part of emitted strobes from the Oqus 300/310 cameras were beyond the measuring range of spectrometer (see Fig 2). Hence, by assuming the emitted light was normally distributed, the spectrum above 900 nm wavelength has been estimated.
Based on the estimation 95% of the projected lights from Qualisys cameras (Oqus 300/310) were within range of 800nm to 900nm with the peak intensity in 854nm. Similarly, Microsoft Kinect v2 projectors emitted NIR rays with 850nm peak intensity but very narrow bandwidths (842.7nm to 859.23nm). Whereas Microsoft Kinect v1 emitted structured pattern within 812.55nm to 841.6nm spectrum, and peak intensity was 827nm (see Fig 2).

Previous studies
Several studies have investigated the accuracy and precision of the estimated depth maps in both generation of the Microsoft Kinect sensors and provided noise models for the sensors [35,41,42]. Mallick et al. [43] divided Kinect noise sources into spatial noise, temporal noise, and interference noise.
Spatial noise might come from axial error, lateral error, object medium, and sensor specifications [43]. Nguyen et al. [44] modeled the axial and lateral noise in Kinect v1 using a flat surface with quadratic and linear functions, respectively. Choo et al. [37] improved the noise models using both flat surface and a 3-dimensional checkboard. In contrast, Pagliari and Pinto [45] showed that the axial noise level in Kinect v2 is more stable at different depths and that it is not a quadratic model compared to the Kinect v1. They modeled the error in the estimated depth in Microsoft Kinect v1 with a second-order polynomial function increases by the distance from the target object, while they reported Microsoft Kinect v2 introduces much less error which slightly increases linearly by distance. Fankhauser et al. [41] showed the axial noise is significantly increased by increasing angle of the Kinect v2 sensor with respect to the surface. The object medium plays an essential role in estimating its depth. Previous studies found that Kinect sensors could not correctly evaluate the position of objects with any transparent, reflective, or IR light-absorbing materials such as water bottles, mirrors or leather fabrics [43,46].
In the literature, two sources of interference were presented: ambient light and multiple sensors. The evidence showed Kinect v1 to have poor performance in the presence of high intensity of wide-spectrum ambient light such as sunlight or halogen lamps [47,48], while, Kinect v2 is more robust to the same disturbing light sources [41,48].
Using multiple Kinect sensors in the same generation could also cause interference between Kinect sensors. The impact of the interference between two Kinect v1 sensors was evaluated in several studies, and the results indicated considerable distortion when the angle between the two sensors was less than 60 degrees [49][50][51]. Sarbolandi et al. [48] showed that using two Kinect v1 increased the unknown depth areas about 10% while using two Kinect v2 sensors simultaneously introduces repetitive interference between the sensors.

Data collection
In this study, five series of recordings in the different distance was performed to evaluate mutual interference between the Microsoft Kinect v2 and a marker-based motion capture system as a golden standard for tracking physical activities. In this study, the Qualisys passive marker-based motion capture system had been utilized as a golden standard.
The region of interest (ROI) was provided by hanging a bulletin board in the middle of the laboratory. The size of the bulletin board was 120cm×120cm and covered by non-reflective, white coarse cotton fabric. Four retroreflective markers were placed on the corners of the The Qualisys cameras were adjusted, where five of the cameras had the scene of the bulletin board in the middle of the recording area while the other three cameras were behind the bulletin board.
The Microsoft Kinect v2 was placed 120cm from the bulletin board covering the whole surface of the bulletin board in the first series of recording. For the next four recording, the Kinect sensor moved away with steps of 100cm. The distance between the sensor and bulletin board was chosen within the recommended rage of working with the Microsoft Kinect v2 for capturing depth images [18,52]. The distance between the bulletin board and the sensor was measured using a Leica DISTO D2 (Leica, Wetzlar, Germany) laser distance meter. The sensor was pointing roughly perpendicular to the bulletin board surface while the center of RGB cameras was adjusted at the center of the bulletin board (see Fig 3).
A customized application was developed to capture and record depth from Microsoft Kinect using Microsoft Kinect SDK version 2.0. The application was developed using Visual Studio 2015, update 3, under Windows Presentation Foundation (WPF) Application Programming Interface (API). Depth information was stored in lossless 16bit PNG image compression Investigating the impact of a MoCap on Kinect One recordings: A caution for using the technologies together with 30 frames per seconds, and for each distance, the depth images were captured for 10 minutes.
To establish global timing in the setup, a server application was developed to communicate with the Microsoft Kinect recording apps and Qualisys Track Manager using Qualisys Track Manager Real-Time (QTM-RT) protocol v1.12. The simultaneous recording ensured using TCP/IP command control through the network, and the between recorder latency kept below 4ms.
Since a small temperature drift in the Kinect sensor has been reported [48], Kinect v2 was turned on 30 minutes prior to the recordings. In addition, the ambient room temperature was controlled (25°c room temperature) while motion lab light was provided by fluorescent light only.

Test protocol
In this study, the impact of the motion capture system on the Microsoft Kinect v2 depth recordings was investigated by using a flat surface (ROI). Depth information of the ROI was captured and stored using a Microsoft Kinect v2 sensor.
To examine the mutual interference, we assumed the recordings referred to a flat surface in the space (i.e., the bulletin board surface). Therefore, the following assumptions were considered, and the employed approach should satisfy them.
1. The smoothness of the bulletin board surface was stable during the recording (the bulletin has a rigid surface); 2. The board might have swung slowly, and the swing rhythm was unpredictable. Therefore, the calculation should be independent to movements of the bulletin board.
3. The accuracy of the Qualisys motion capture system did not change during the experiments; Consequently, we assumed an optimal plane that satisfied cloud point on the bulletin board surface. The optimal plane was estimated by calculating the average position of point cloud P n×3 , and assuming the average position satisfied the plane equation. Therefore, the normal vector of the optimal plane was calculated based on (1).
where N 1×3 is the normal vector of the optimal plane and M is the average position of the cloud point. In this equation, the cloud point in the Kinect records was the depth maps, and similarly, an optimal plane can be introduced by the reflective marker positions in the Qualisys records. Fig 4 represents an optimal plane based on a single Kinect v2 point cloud.
As can be seen in Fig 4, the acquired bulletin board had some degree of roughness, which might be due to the real roughness of the bulletin board surface or to the resolution of the Kinect. Hence, the residual value for each depth frame (i) in each pixel position (x,y) was defined by calculating Euclidean distance between recorded depth and estimated depth based on the optimal plane.
In this equation, D i (x,y) represents recorded depth in x and y position, while D ref i ðx; yÞ stands as the expected depth from the optimal plane in the record frame i. Therefore, the roughness of the region of interest has been estimated by calculating root mean squares (RMS) of residual values in each frame. Hence, for each frame, residual root mean squares (RRMS Frame ) were calculated according to (3).
where, m, n, x 0 and y 0 limit the calculations to the area on interest. The surface roughness was assessed along with an estimate of entropy; these estimates were used to measure the impact of the noise sources on the region of interest. The roughness of each pixel position was also involved in estimations instead of calculating the roughness of each frame. Accordingly, the residual root mean square value of RRMS Pixel was calculated based on (4).

RRMS Pixel x; y
Where n is the number of images in each record and x and y stand as the position of the pixel in the region of interest. Entropy was estimated using (5), where p x,y is the probability of observed value in the pixel x,y .
In this study, for simplifying the possible noise sources, the interferences can be divided into passive and active distortion. As a result, the near-infrared laser projector of Kinect sensor and the projector of each camera in the motion capture system are considered as active sources of distortion, and the retroreflective markers in this study were categorized as a passive noise source.

Residual root mean squares (RRMS)
The statistical analysis of the RRMS frame values indicated the calculated values were not normally distributed (Shapiro-Wilk normality test p< 0.05). Consequently, median and interquartile range (IQR) of calculated RRMS Frame in absence and presence of Qualisys system as noise source are shown in Table 2. In addition, a statistical test of each paired recording was investigated by two-sided Wilcoxon test.
The corresponding estimation of measured RRMS Frame is worked out using the first order Fourier estimator, and it is shown in Fig 5. The Bland-Altman analysis is employed to evaluate the impact of the motion capture system on the estimated RRMS Frame by comparing the measurements in presence and absence Qualisys. The Bland-Altman analysis and corresponding limits of agreement and the bias of RRMS Frame for each configuration were calculated. Bland-Altman analysis revealed bias and limits of agreements had an increasing trend by increasing distance to the surface as is shown in Table 3.  Investigating the impact of a MoCap on Kinect One recordings: A caution for using the technologies together Accordingly, RRMS Pixel has been calculated based on (4) for each pixel in Kinect depth records. Fig 6 shows Qualisys increased RRMS Pixel in Kinect v2 records. Apparently, the size of the ROI (in terms of the number of pixels) decreased by increasing distance from the bulletin board.

Entropy
The pixel-wise entropy of recorded depth images was estimated based on (5) . Fig 7 compares pixel-wise entropy with Kinect v2 depth records in absence and presence of Qualisys as a noise source in 5 different distances.

Retroreflective markers and LED strobes
The result showed Kinect v2 was also sensitive to the reflected light form retroreflective markers and Qualisys cameras (see Figs 1A, 8 and 9). Figs 8 and 9 show the impact of reflective markers and Qualisys cameras on IR and depth images. As can be seen in figures, not only the reflective markers that appear as bright spots but also the Qualisys cameras surrounded by an aureole of bright dots. Surprisingly in the depth image, reflective markers are seen like black holes of unknown distance. However, the reflective balls were not a part of ROI (Fig 3); the impact of reflective markers was investigated by assessing the heterogeneity of pixel on each corner. The assessment did not represent any significant changes due to the presence of reflective markers nearby the ROI.

Discussion
This study evaluated the impact of a marker-based motion capture system on the Microsoft Kinect v2 sensor in five different distances. The interference was estimated based on captured raw depth images using Microsoft Kinect v2. It was hypnotized that by placing the bulletin board and Qualisys cameras in a steady position, the amount of the projected ray to bulletin board surface will be constant. Accordingly, the contribution of the motion capture system in all the measured distances as noise source will be the same.
The results showed that Kinect v2 recordings are sensitive to the presence of the Qualisys cameras and that they might have a destructive impact on Kinect v2 measurements and postprocessing calculations. Consequently, certain degrees of uncertainty might be imposed on Kinect v2 measurement. To the best of our knowledge, none of the previous studies have reported Kinect v2 depth information distortion due to a motion capture system, and retroreflective markers.
In Fig 5, it can be seen by increasing distance from the ROI sensitivity of the Microsoft Kinect is rising and the distortion is nonlinear toward the distance.
The pixel-wise RRMS and entropy also indicate the motion capture system has a considerable impact on the Microsoft Kinect recordings and it increases by increasing the distance (see Figs 6 and 7).
Having examined Figs 6 and 7, it can clearly be seen in the lens distortion on the depth recording when the Kinect sensor was placed at 120cm of the ROI. By increasing the distance, the lens distortion disappeared gradually. Investigating the impact of a MoCap on Kinect One recordings: A caution for using the technologies together Hence, it may be concluded the Qualisys motion capture system has a clear impact on the Kinect v2 depth estimation. This interference distorts the depth maps continuously and with a high variation.
Retroreflective markers were also a source of passive noise as depicted in Fig 9. Thus, retroreflective markers in the Kinect v2 IR images appeared like bright balls, resulting in areas of unknown distance (depth) in the depth map. The figure reveals not only areas of unknown distance in the marker position, but also wrong distance values surrounding the silhouettes of the marker edges.
It can be concluded that two factors are involved in a motion capture system interfering with Microsoft Kinect recordings. First, the emitted light from the motion capture system should be within the same range of Microsoft Kinect sensor sensitivity. Secondly, the intensity of distortion on Kinect measurements might be highly dependent on the depth images reconstruction approach. Fig 2 indicates Qualisys cameras project a wide spectrum NIR lights (from 800nm to beyond 900nm) than Kinect v2, while Kinect v1 and Kinect v2 emit narrowband NIR lights centered at 825nm and 850nm respectively. Therefore, we may conclude Kinect sensor is Investigating the impact of a MoCap on Kinect One recordings: A caution for using the technologies together sensitive to projected light from Qualisys cameras. This satisfies the requirement for interfering with Kinect recordings.
Regarding the result, we can conclude, ToF based depth sensors (Kinect v2) are potentially sensitive to the motion capture system. Whereas, Qualisys utilize a fast time division between mounted LED strobes on each camera. Consequently, it might be the primary cause of interference with the intensity modulation in Kinect v2.
This study had three limitations. First, only depth data of Microsoft Kinect v2 were analyzed to investigate the contribution of possible noise on the recordings. However, Microsoft Kinect skeleton algorithm might also be affected by the motion capture system. In addition, only a steady and flat surface was chosen to evaluate the noise impact on Microsoft Kinect v2 depth data recordings. Finally, in this study due to the lack of detail information about Microsoft Kinect v2 working principle, we were not able to provide a model for the shown noise in the presence of the motion capture system.
In the current study, only the impact of the Qualisys motion capture system (as a markerbased motion capture system) on Kinect sensor recordings was assessed. Other motion capture systems might have a different destructive impact. In theory, all the marker-based motion capture systems might interfere with Microsoft Kinect v2 recordings. Therefore, it is recommended that Kinect v2 sensitivity to the presence of motion capture systems should be investigated before both systems are used together. It could be hypothesized that the actual validity and reliability of Microsoft Kinect v2 (depth data and accordingly estimated skeleton) might be higher the reported values in the previous studies due to the active and passive noises from the motion capture systems.

Conclusions
The influence of passive and active noise sources on depth assessments by Microsoft Kinect sensor and Qualisys motion capture system were evaluated. The findings indicate that the Kinect v2 sensor is not only affected by the Qualisys motion capture system but also that the presence of retroreflective markers plays an essential role in producing misleading Kinect v2 measurements.