UniGAHA: Audio-Driven Universal Gaussian Head Avatars

SIGGRAPH Asia 2025
1 Max Planck Institute for Informatics, Germany 2 Saarland Informatics Campus, Germany 3 Imperial College London, United Kingdom

Play with Audio. Our method synthesizes photorealistic 3D head avatars directly from speech. The clip demonstrates how raw audio drives accurate lip synchronization, natural upper facial motion, inlcuding appearance changes (e.g., gaze shifts and realistic mouth interiors), while faithfully preserving the speaker’s identity.

Audio-Driven Consistent Cross-Identity Animation

Universal Head Avatar Prior (UHAP)

Personalization & Cross-Identity Animation. Our Universal Head Avatar Prior (UHAP) adapts to new and unseen users from minimal input data including monocular videos and static captures. The audio-driven model seamlessly maps speech signals into the latent expression space of the prior, ensuring consistent animations across diverse facial identities while maintaining high-fidelity lip synchronization and plausible upper-face motion.

Monocular Video-Driven Animation

Video-Driven Expression Transfer. Our monocular image encoder, originally designed to onboard new users into the Universal Head Avatar Prior (UHAP), can be repurposed as a powerful expression transfer tool. By processing driving videos from a single camera viewpoint, the model extracts facial expressions and seamlessly transfers them to target avatars, enabling video-driven animation alongside our core audio-driven capabilities.

Citation

@misc{teotia2025audiodrivenuniversalgaussianhead,
      title={Audio-Driven Universal Gaussian Head Avatars}, 
      author={Kartik Teotia and Helge Rhodin and Mohit Mendiratta and Hyeongwoo Kim and Marc Habermann and Christian Theobalt},
      year={2025},
      eprint={2509.18924},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.18924}, 
}