Highlights
- Sota / close-to-sota results in monocular dynamic motion capture.
- Combines learned global motion regression with hand-crafted optimization for in-the-wild setups.
- Publication of thesis findings is ongoing. Code will be released upon publication.
Abstract
Recovering 3D human motion from monocular video sequences poses a significant challenge in computer vision, particularly when the camera itself is in motion. The ambiguity introduced by dynamic recording setups necessitates methods to lift camera-local 3D human motions into a consistent, global world frame. This thesis proposes a novel, modular approach to monocular multi-person motion capture, combining regression techniques and global optimization for enhanced accuracy.
Our pipeline for 3D motion recovery begins with image-based detection to localize multiple human subjects within each frame. We then fit parametric human body models (SMPL) to estimate the subjects’ 3D poses, resulting in camera-local human pose tracks. To recover camera motion, we implement a visual odometry (VO) algorithm. Next, we port a state-of-the-art global motion regression network to initially lift camera-local motions into a fixed world frame. Finally, we apply a global optimization process guided by re-projection quality, motion realism, and motion smoothness to refine the lifted motion estimates within the global 3D world frame.
The core contribution of this thesis is the demonstration of the effectiveness of combining global motion regression with optimization in a chained manner. Ablation studies confirm that this hybrid approach yields superior results compared to the isolated use of either regression or optimization techniques. Our experimental results show that the proposed method achieves performance closely aligned with the state of-the-art in SMPL-based human motion recovery.
Keywords
motion recovery, monocular, dynamic, local-to-global motion lifting, optimization