Learning 3D Human Geometry and Appearance via Sparse Multiview Images.

Zhixuan Yu

Humans are arguably the most interesting subjects in computer vision. Modeling human 3D geometry from images captured by highly sophisticated production level cameras (10-100 cameras with precise calibration) enable a number of applications, e.g. telepresence, virtual try-on, motion analysis, etc. Despite its production level quality, the difficulty in system deployment and the extremely high cost prevent accessibility to the majorities. On the other hand, as smartphones equipped with cameras become an integral part of our everyday lives that capture priceless moments, social videos voluntarily captured by multiple viewers watching the same scene, e.g., friends recording a street busker simultaneously, provide a new form of visual input source accessible by everyone. My research question is whether it is possible to model humans from social videos at high quality as if they are taken by the production level setup. Enabling this will open a new opportunity to model 3D human geometry from in-the-wild data. The main characteristics of such videos are that they are "sparse multiview" by nature, and not "spatially calibrated" in general. These characteristics pose an unprecedented challenge because existing multiview approaches of 3D reconstruction do not apply: due to the sparse multiview camera setting, the overlap between social cameras are very limited where 3D photometric matching is difficult, and due to lack of calibration, existing geometric triangulation does not apply. To date, there is no principle way to integrate multiview social images. In order to reconstruct 3D humans from social videos, I leverage the complimentary relationship between 3D geometry and learning, which can help each other. (1) Multiview geometry [right arrow] learning (Part 1): I design a framework that can learn dense keypoint mappings (i.e. correspondences between human pixels and a canonical 3D body surface agnostic to identifies, views and poses) from unlabeled sparse multiview images with minimal overlap. The key insight is to leverage multiview geometric consistency as a self-supervisionary signal by enforcing epipolar constraint for corresponding pixels (mapped to the same location on the 3D body surface) from different views. I demonstrate that the method shows superior performance compared to existing methods, including non-differentiable bootstrapping in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy. (2) Learning [right arrow] 3D geometry (Part 2): I develop a learning-based 3D reconstruction method that can integrate visual cues from multiview images without a spatial calibration and estimate a unified human 3D geometry. The key idea is to treat the commonly observed human body as a semantic calibration target and utilize pre-learned dense keypoint mappings to semantically align visual features from multiview images on a canonical 3D body surface, where features are fused for predicting 3D human body shape and pose. I demonstrate that this calibration-free multiview fusion method reliably reconstructs 3D body pose and shape, outperforming state-of-the-art single view methods with post-hoc multiview fusion, particularly in the presence of non-trivial occlusion, and showing comparable accuracy to multiview methods that require calibration. Given reconstructed 3D human geometry, I further establish an approach to create geometry-anchored animatable 3D head avatars with photo-realistic appearance from sparse inputs per user, e.g. just a few selfies from different views taken by a smartphone casually (Part 3). The core of this approach is to learn a universal model from various identities of a range of expressions that encodes generic characteristics of animatable head avatars, which can serve as a prior model capable of being adapted to a new subject given only a few images. I demonstrate that this approach produces compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation. To facilitate modeling 3D human geometry and appearance, I create a large multiview dataset for human body expressions (Part 4). 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and physical condition performing predefined actions. With the dense multiview image streams, I reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance using their canonical atlas. I demonstrate that the dataset is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. In summary, this thesis presents three closely-related methods for learning human geometry and appearance via sparse multiview images. Their inputs and outputs are linked together in a chain: 2D dense keypoints [right arrow] 3D geometry [right arrow] appearance, output of one being input for the next. Besides, it introduces a large multiview dataset for human body expressions to facilitate this goal. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml.]