HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

1CUHK, 2Shanghai Artificial Intelligence Laboratory

By training a simple baseline combined by Animate Anyone and CameraCtrl on our HumanVid dataset without any tricks, we are able to generate movie-level videos with both character pose and camera movements. (Top left): input image; (Top right): human pose condition; (Bottom left): our result; (Bottom right): groundtruth video. The camera condition is not visualized in this video.

Videos may take a few seconds to load.

Abstract

Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark.

As our paper is still under review, we cannot release code and data immediately. We plan to release data in a few weeks. The training/inference code and checkpoint will be publicly available in late September, 2024. Should you have any enquiries, please contact the first author.

MY ALT TEXT

Overview of our simple baseline combined by Animate Anyone and CameraCtrl.

In-the-wild Cases

All the cases shown below are test set results. Horizontal and vertical videos are generated by the same model weights. You can use the maximize button in the bottom right corner of the video to play the video in full screen mode to observe more details.

Pexels Test Set Cases (Horizontal)

Pexels Test Set Cases (Vertical)

From left to right: input image, human pose condition, our result, groundtruth video.

TikTok Test Set Cases

The Tiktok test set uses the last 40 videos. Visualizations are from this set, using a model trained on HumanVid and the first 300 Tiktok videos. By setting camera parameters to be static, it produce static backgrounds which is consistent to Animate Anyone's setting.

Cross-identity Cases

Our model requires paired camera trajectory and 2D human pose, making cross-identity inference more complex than Animate Anyone's static camera setting. We show successful cases with simple backgrounds, including source videos (first 2 videos) and cross-identity results (other videos). Besides extracting data from existing videos, you can project 3D human poses to camera space and export custom camera trajectories using software like Blender.

Comparison with Previous methods

Visualization of Synthetic Videos

TL;DR: We synthesis two types of human-centric videos according to the asset of humans: (1) SMPL-X poses with UV texturemap and simulated clothes similar to BEDLAM, and (2) 3D anime character asset with rigged motions. The backgrounds are from HDRI images or 3D scenes. The motivation of using synthetic data is accurate human/camera poses and more diverse camera trajectories.

To view real videos, visit Pexels.com and search for human-related terms as we cannot redistribute their videos. Our HumanVid dataset's real part is filtered from Pexels videos.