This thesis studies the problem of articulated object reconstruction from an input video. Our focus is on estimating the shape, pose, and part motion of an articulated object during human-object manipulation. The task is challenging as the object is dynamically changing and 3D reconstruction from 2D is inherently ambiguous. To enable research in this direction, we first create D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object shape, pose and part motion from human-object interaction videos. Our dataset consists of several common categories of articulated objects in diverse real-world scenes, observed from a variety of fixed camera view points. Each manipulated object (e.g., microwave) is represented using a 3D parametric model that best fits the captured data. We then annotate the size, pose, and part articulation values at every frame. A novel optimization-based method is proposed based on differentiable renderer and human-object interaction terms, which leverage the human pose for better inferring of the object spatial layout and dynamics. We evaluate this new approach on our dataset, demonstrating that human-object relations can significantly reduce the pose and motion errors on real-world articulated objects. Code and dataset are available at the following link (https://github.com/facebookresearch/d3d-hoi).
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Furukawa, Yasutaka
Member of collection