As people often account for a large proportion of our daily visual content, understanding human images is an important problem in computer vision, which impacts many applications ranging from human-computer-interaction, surveillance, telecommunication to entertainment. In this thesis, our goal is to teach machine to reconstruct dense human representations with respect to geometry and semantics from a single RGB image. Particularly, we specify the goals as two fundamental problems in computer vision: depth estimation and semantic parsing. In monocular depth estimation, deep learning methods recently achieved great success for rigid objects and static scenes. However, estimating non-rigid surface such as clothed human body is still a challenging problem. Due to the large diversity of human poses, shapes and types of clothing, previous works lean to predict coarse human surface from a single image and usually fail to make generalizations to the unseen poses and clothing. To overcome the limitations, we proposed a novel neural network that benefits from (1) leveraging 2D/3D pose estimation and part segmentation as intermediate supervision, (2) disentangling the full human depth into a smooth base depth and a detail residual depth for respective regressions, and (3) jointly optimizing normal and depth in the network. Besides, the huge demands of large-scale 'ground truth' data in learning human depth is an another critical problem. To alleviate the cost on collecting training data, we introduce a self-supervised method to learn human depth, which is achieved by minimizing a photo-consistency loss between video frames. The method can applied to trained the neural network on in-the-wild videos without known depth, which makes training data collection much simpler and improves the generalizability of the algorithm. In human semantic parsing, prior art heavily relies on ground-truth data which require expensive laborious works on semantic annotations. Comparatively, acquiring human geometry is an easier task with depth sensors and synthetic 3D assets. In this thesis, we present a deep learning framework which can extract pixel-wise embeddings of a person from an image. The embeddings can be used to build dense semantic correspondences between humans. We proposed to train the model on the data without ground-truth annotations. The idea is to leverage the geometry property of human body by training the neural network to map human pixels to a feature space, where the feature distances can reflect the geodesic distances of the human surface. Novel loss functions are designed to achieve the training. Extensive experiments show that the learned embeddings can produce accurate correspondences between the same or different persons with remarkable generalization capabilities.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Tan, Ping
Member of collection