ECVA | European Computer Vision Association

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel*, Thomas LUCAS, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Gregory Rogez ;

Abstract

"We present , a strong model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, , including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, , without relying on explicit crops around body parts, is hard to learn from existing data, we introduce , the dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on 448×448 images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results."

Related Material

[pdf] [DOI]