OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis

ByteDance Inc, National University of Singapore
CVPR 2023

We present OmniAvatar, a controllable 3D-aware image synthesis network with disentangled control over camera pose, head shape and facial expression, including neck and jaw articulations.

Abstract

We present OmniAvatar, a novel geometry-guided 3D head synthesis model trained from in-the-wild unstructured images that is capable of synthesizing diverse identity-preserved 3D heads with compelling dynamic details under full disentangled control over camera poses, facial expressions, head shapes, articulated neck and jaw poses. To achieve such high level of disentangled control, we first explicitly define a novel semantic signed distance function (SDF) around a head geometry (FLAME) conditioned on the control parameters. This semantic SDF allows us to build a differentiable volumetric correspondence map from the observation space to a disentangled canonical space from all the control parameters. We then leverage the 3D-aware GAN framework (EG3D) to synthesize detailed shape and appearance of 3D full heads in the canonical space, followed by a volume rendering step guided by the volumetric correspondence map to output into the observation space. To ensure the control accuracy on the synthesized head shapes and expressions, we introduce a geometry prior loss to conform to head SDF and a control loss to conform to the expression code. Further, we enhance the temporal realism with dynamic details conditioned upon varying expressions and joint poses. Our model can synthesize more preferable identity-preserved 3D heads with compelling dynamic details compared to the state-of-the-art methods both qualitatively and quantitatively. We also provide an ablation study to justify many of our system design choices.

Pipeline

MY ALT TEXT

Stage I: Trained from parameterized FLAME mesh collections, a MLP-network W maps a shape α, expression θ and articulated jaw and neck pose θ into 3D point-to-point volumetric correspondences from observation to canonical space, together with a signed distance function of the corresponding FLAME head. Stage II: Given a Gaussian latent code z, our model generates a tri-plane represented 3D feature space of a canonical head, disentangled with shape and expression controls. The volume rendering is then guided by the volumetric correspondence field to map the decoded neural radiance field from the canonical to observation space. We condition the NeRF decoding with expression and joint pose for modeling dynamic details. A super-resolution module synthesizes the final high-resolution RGB image from the volume-rendered feature map. For fine-grained shape and expression control, we apply the FLAME SDF as geometric prior to the synthesized NeRF density, and self-supervise the image synthesis to commit to the target expression β and joint pose θ by comparing the input code against the re-estimated values \hat{β}, \hat{θ} from synthesized images.

Results

Poster

MY ALT TEXT

BibTeX

@InProceedings{Xu_2023_CVPR_OmniAvatar,
        author    = {Xu, Hongyi and Guoxian Song and Zihang Jiang and Jianfeng Zhang and Yichun Shi and Jing Liu and Wanchun Ma and Jiashi Feng and Linjie Luo},
        title     = {OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2023},
        pages     = {12814-12824}
    }