MoVi: A Large Multipurpose Motion and Video Dataset

Human movements are both an area of intense study and the basis of many applications such as character animation. For many applications, it is crucial to identify movements from videos or analyze datasets of movements. Here we introduce a new human Motion and Video dataset MoVi, which we make available publicly. It contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement. In five capture rounds, the same actors and movements were recorded using different hardware systems, including an optical motion capture system, video cameras, and inertial measurement units (IMU). For some of the capture rounds, the actors were recorded when wearing natural clothing, for the other rounds they wore minimal clothing. In total, our dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data. In this paper, we describe how the dataset was collected and post-processed; We present state-of-the-art estimates of skeletal motions and full-body shape deformations associated with skeletal motion. We discuss examples for potential studies this dataset could enable.

While there are many publicly available datasets of human motion recordings [49,35,12,56,25], they are limited in that they either contain data of a small number of different actors, use single hardware systems for motion recording, or provide unsynchronized data across different hardware systems. We overcome these limitations with our large Motion and Video dataset (MoVi) that contains five different subsets of synchronised and calibrated video, optical motion capture (MoCap), and inertial measurement units (IMU) data of 90 female and male actors performing a set of 20 predefined everyday actions and sports movements, and one self-chosen movement. MoVi is a multi-purpose human video and motion dataset designed for a variety of challenges such as human pose estimation, action recognition, motion modelling, gait analysis, and body shape reconstruction. To our knowledge, this is one of the largest datasets in terms of the recorded number of actors and performed actions.
The 3D ground truth skeletal pose in MoVi was computed using two different pipelines: V3D (bio-mechanics formulation) [1] and MoSh++ (regression model) [34]. This allows a comparison of these two formulations and pro- vides more options for the computed pose, depending on the tasks and challenges at hand (see section 2.5). MoVi is also part of the Archive of Motion Capture as Surface Shapes (AMASS) [34], available at https://amass.is.tue. mpg.de/. The approach of AMASS allows to estimate accurate body shape that is factorized into individual, poseindependent shape components and pose-dependent components for every single frame of the MoCap recordings. The resulting animated 3D meshes can be aligned with the camera coordinate system and be treated as ground truth 3D body shapes (Figure 1).

Summary of the Data
MoVi contains data from 90 subjects performing the same predefined set of 20 actions and one self-chosen movement in five rounds of data capturing. 90 people (60 females, 30 males; 5 left handed) were recruited from the local Kingston community. Descriptive statistics of all participants are shown in Table 1. Participants provided written informed consent. The experimental procedure was approved by the ethics committee of Queen's University, Kingston, and was performed in accordance with the Declaration of Helsinki.
The actors performed the same predefined set of 20 movements in a randomised order in five data capturing se-   (18) Pretending to take a picture, (19) Pretending to talk on the phone, (20) Pretending to check one's watch. In each of the five sequences, the actors additionally performed one selfchosen motion (21). The five sequences of data capturing differed in the hardware systems used to capture the motions, in participants' clothing (minimal, or normal), and whether or not there was a rest pose between successive motions. An overview of the different capture rounds is provided in Table 2, technical details of the hardware systems are provided in Table 3.
Data capture sequence "F" was captured using the 67 MoCap marker layout suggested in MoSh [32]. Actors wore tight-fitting minimal clothing in order to minimize marker movement relative to the body. The markers were attached onto the actors' skin and clothes using double-sided tape. The MoCap system was synchronized with two video cameras capturing the actions from different viewpoints (front and side). Those two cameras were calibrated by computing the translation and rotation of the cameras relative to the coordinate system of the MoCap system. Two hand-held cellphone cameras were additionally used, however, the recordings were neither synchronized nor calibrated against the MoCap system. The different actions were separated by a rest A-pose. In our dataset, we provide the unedited full sequence of all motions, as well as trimmed MoCap and video sequences of the single motions. Our motivation for this capture round was to obtain accurate full skeletal (pose) information and frame-by-frame body shape parameters without any artifacts imposed by clothing. Therefore, this round can be considered more suitable for 2D or 3D pose estimation and tracking, and 3D shape reconstruction. The data collected in "F" was processed using two different pipelines: MoSh++ [34] and V3D [1] (see 2.5). Example images of a female and male actor in rest pose are shown in Figure 1.
To achieve more natural looking capture data, we recorded four more capture rounds where the actors wore normal clothing. Data capture rounds "S1" and "S2" were captured with a sparse set of 12 MoCap markers (4 markers placed on the head, 2 on each ankle and 2 on each wrist) which allowed the participants to wear normal clothing. Having the attached markers we could accurately extract the main end-effectors including the head, wrists, and ankles. It further allowed us to synchronize the IMU data with the MoCap and video capture system (Section 2.3). The actions were additionally recorded using synchronized computer vision cameras, cellphone cameras, and an IMU system. Whereas a rest A-pose separated the actions in "S1", there was a natural transition between the different actions in "S2". This setup was used as it allows to infer the pose and body shape by fusing a sparse marker set and IMU recordings while keeping the clothing natural.
While real clothing is essential for many meaningful data, it precludes the use of certain motion capture techniques. The data capture rounds "I1" and "I2" were thus captured using only IMU and video cameras (not synchronised). Motions in "I1" are separated by a rest A-pose, whereas there is a natural transition between the different actions in "I2". The data collected in "I1" and "I2" is suitable for researchers that aim for computing pose or body shape without any artifacts imposed by the optical markers. The examples of the IMU suit used for "S1", "S2", "I1", and "I2" are shown in Figure 3. These recordings thus promise to enable a broad range of real-world applications.

Hardware
The movements were captured using two different hardware systems, an optical motion capture system and an inertial measurement unit system. We used a commercial optical motion capture system from Qualisys with fifteen 1.3 MP cameras that provide the 3D location of passive reflective markers with a frame rate of 120 per second. For the IMU system, we used the Noiton Neuron Edition V2 which comes as a suit attached with 18 IMU sensor. Each sensor is composed of a 3-axis gyroscope, a 3-axis accelerometer, and a 3-axis magnetometer working with 120 fps. In addition to the global acceleration data, the IMU suit provides 3D displacements, speed, quaternions, and rotational speed for each joint. For distinct questions different hardware will be useful. For example, IMUs are great at acceleration, while other capture systems produce high precision. The use of different, complementary hardware system promises to allow a broad range of meaningful analyses.
Video data was collected using two different types of cameras, smartphone cameras and computer vision cameras. We used two hand-held IPhone 7 smartphone cameras with a 800 × 600 resolution, global shutter, and 30 fps. As opposed to the computer vision cameras, the footage obtained with those smartphone cameras is shaky due to natural arm and hand movements. The video quality is similar to what the majority of commercially available smartphone cameras provides to date. For the two computer vision cameras we used Grasshopers cameras from FLIR Inc company with 800 × 600 Sony ICX285 CCD sensors. The recordings of the FLIR cameras are synchronized with the MoCap cameras with 30 fps (aligned with every forth frame of the MoCap system). Detailed information of the used hardware is provided in Table 3. Figure 2 shows the top-view floor plan and the location of the MoCap and video cameras. The process of how the devices were synchronized are described in Section 2.3, the camera calibration is describes in Section 2.4.

MoCap and Video
To provide a frame-by-frame accurate 3D motion overlaid on the video footage, the motion capture system should be synchronized with cameras in frame and then calibrated to the same coordinate. The synchronization between motion capture cameras and the FLIR Grasshopers video cameras was done in hardware. In our setup, the video cameras where triggered by the synchronization signal provided by the MoCap system. Due to the frame rate limits in video cameras, the synchronization frequency was divided by 4 which reduced the video capture frame rate to 30 fps. The phone cameras were not synchronized with the motion capture cameras.

IMU
In round "S1" and "S2", we used a reduced optical motion marker set layout with 12 markers. Although the main motivation for using this reduced marker set was to allow the actors to wear natural clothing, this small set of markers offers several advantages: 1) It provides sparse but accurate data for some of the main joints (head, wrists, and ankles). The data can be applied in a data fusion approach along with IMU data to infer the exact joint locations. We leave this to future work. 2) It allowed us to synchronized IMU and MoCap data. To synchronize the data, we used cross-correlation between these two modalities. The two coordinate systems were not aligned, however, the differences between the orientation of the two z axes is negligible: the z axis of the IMU coordinate system is oriented towards gravity, while z axis in MoCap coordinate system is perpendicular to the floor. Because the MoCap system was synchronised with the video cameras (Section 2.3), we additionally obtained synchronized IMU and video data.
Suppose v j z (t) andṽ j z (t) are the z component of tracked position of joint j recovered by the motion capture and IMU systems, respectively (we are using the 3D positions provided by the IMU software instead of double-integrating over accelerations). The synchronization parameters, temporal scale α and temporal shift β, are found by maximizing: where is the cross-correlation between v j z (t) and shifted-andscaled version ofṽ j z (t). α and β are the scale and shift parameters, respectively. The optimal parameters are those which achieve the highest peak in cross-correlation.
The procedure mentioned above was done for left and right ankles and checked qualitatively for all data.

Calibration
The calibration of the MoCap cameras were done by a measurement procedure in Qualisys Track Manager software [46]. The software allows to compute the orientation and position of each camera in order to track and perform calculations on the 2D data into 3D data. To compute the Figure 3: Example pictures of one female and one male actor wearing the IMU suits used for the capture rounds S1, S2, I1, and I2.  computer vision cameras' intrinsics and lens distortion parameters, we used the MATLAB Single Camera Calibrator [21,61,8], where focal length (F ∈ R 2 ), optical center (C ∈ R 2 ), skew coefficient (S ∈ R), and radial distortion (D ∈ R 2 ) are estimated for each camera. The extrinsic parameters which represent the rotation R ∈ SO(3) and translation T ∈ R 3 transformations from world coordinates (MoCap coordinate system) to camera coordinates, are estimated using the semi-automated method proposed by Sigal et al. [49]. The trajectory of a single moving marker was recorded by synchronized MoCap and video cameras for around 2000 synchronized frames. Given the recorded 3D positions of the marker in MoCap coordinates as world points and the 2D positions of the marker in the camera frame as image points, the problem of finding the best 2D projection can be formulated as a Perspectiven-Point (PnP) problem where the Perspective-Three-Point (P3P) algorithm [18] is used to minimize the re-projection error as follows: where f is the projection function and K ∈ {F, C, S, D} is the set of camera intrinsics parameters.

Skeleton and Body Shape Extraction from Mo-Cap data
The skeleton (joint locations and bones) were computed with two different pipelines. Visual 3D software (manufacturer C-MOTION): Visual 3D is an advanced biomechanics analysis software for 3D motion capture data [1]. In our V3D pipeline, pelvic segment was created using CODA [2] and the hip joints positions were estimated by Bell and Brand hip joint center regression [5,6]. The upper body parts were estimated using Golem/Plug-in Gait Upper Extremity model as implemented in Vicon [1]. The skeleton is represented by 20 joints in two different formats: 1) in  joint angles, that is the angle of each bone relative to coordinate system of its parent joint, and 2) as global 3D joint locations. MoSh++: MoSh++ [34] is an approach which estimates the body shape, pose, and soft tissue deformation directly from motion capture data. Body shape and pose are represented using a rigged body model called SMPL [33] where the pose is defined by joint angles and shape is specified by shape blend shapes. MoSh++ uses a generative inference approach whereby the SMPL body shape and pose parameters are optimized to minimize reconstruction errors. The skeletal Joints location are computed using a linear regression function of mesh vertices. The estimated SMPL body is extended by adding dynamic blend shapes using the dynamic shape space of DMPL. Each frame in the "MoShed" representation includes 16 SMPL shape coefficients, 8 DMPL dynamic soft-tissue coefficients, and 66 SMPL pose coefficients as joint angles (21 joints + 1 root). MoShed data was computed in collaboration with the authors of AMASS [34].
The main difference between the skeleton represented by MoSh and the skeleton represented by V3D is that the MoShed version is generally more robust to occlusion because it uses distributed information, rather than doing the computations locally. This makes it a better choice for the task of pose estimation and tracking as the joints locations are all available over the time. However, the estimated joint location can be noisy during the occlusion and the error may propagate to other joints too. On the other hand, V3D provides a more accurate estimation of joint location. Therefore, one may prefer using the V3D joints representation for the task of gait analysis. The only drawback of V3D representation compared to MoSh++ is that the joints cannot be computed when a related marker is occluded. Our dataset is the first sizable dataset including not only 3D joint locations, but also a highly accurate 3D mesh of the body which can be projected onto the video recordings. This can be useful for approaches that try to estimate body shape from video data.
F amass subject ID .mat -Contains the full marker set MoCap data processed by MoSh++ in the AMASS project and augmented with 3D joints' positions and metadata. All files are compressed and stored as F AMASS.rar. The original npz files and the rendered animation files are available at https:// amass.is.tue.mpg.de/ round v3d subject ID .mat -Contains the MoCap data processed by V3D and augmented with metadata. All files are compressed as subject 1 45 F V3D.rar which contains "F" round data from subject 1 to 45, subject 46 90 F V3D.rar which contains "F" round data from subject 46 to 90, and S V3D.rar which contains "S1" and "S2" rounds data from all subjects.

Dataset structure
We used the Dataverse repository to store the motion and video data. We provide the original AVI video files to avoid any artifacts added by compression methods. The processed MoCap data is provided in two different versions based on the post-processing pipeline (AMASS and V3D). We provide joint angles and joint 3D locations computed by both pipeline along with the associated kinematic tree, occlusions, and optical marker data. Synchronized IMU data (along with the original data) are computed by processing calculation files (see Section 2.3) and converted to mat format which provides raw acceleration data, displacement, velocity, quaternions and angular velocity. The bvh files generated by the IMU software are also provided on the website. The support code to use data in MATLAB and Python environments is also provided. The dataset naming structure is provided in Table 4.

Discussion and Conclusion
We provide the large Motion and Video dataset MoVi, which is now available at https://www. biomotionlab.ca/movi. The dataset contains motion recordings (optical motion capture, video, and IMU) of 90 male and female actors performing a set of 20 everyday actions and sports motions, and one additional self-chosen motion. The different sequences of the dataset contain synchronized recordings of the three different hardware systems. In addition, our full-body motion capture recordings are available as realistic 3D human meshed represented by a rigged body model as part of the AMASS dataset [34]. This allows video overlay of not only the body joints, but also the full body meshes. To our knowledge, MoVi is the first dataset with synchronized pose, pose-dependent shape, and video recordings. The multi-modality makes our dataset suitable for a wide range of challenges such as human pose estimation and tracking, body shape estimation, human motion prediction and synthesis, action recognition, and gait analysis.