SiMHand: Mining Similar Hands for 3D Hand Pose Pre-Training


Nie Lin1*   Takehiko Ohkawa1*  Yifei Huang1   Mingfang Zhang1   Ming Li1   Minjie Cai2   Ryosuke Furuta1   Yoichi Sato1

1The University of Tokyo   2Hunan University  *Equal Contribution

International Conference on Learning Representations (ICLR), 2025





Abstract

We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SiMHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs solely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands.

Similar Hands Mining for Positive Pairing

To incorporate diverse samples in contrastive learning, we design positive pairs from non-identical images with similar foreground hands. We construct a mining algorithm to find similar hands from large human-centric videos (Ego4D and 100DoH) by focusing on pose similarity between hand images. For illustration, we denote “Top-1” (most similar) as our assigned positive sample to the query image. As references, the rest of the figures (“Top-K”) represent the K-th similar samples. Our sampling highlights the diversity in captured environments and interactions, while it also suggests that as the rank (distance) increases, the sampled images become dissimilar.

Contrastive Learning from Similar Hands with Adaptive Weighting

We propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. A naive adaptation of the similar hands is to replace the original positive pairs in contrastive learning with the mined similar hands, but this scheme struggles to exploit how similar the paired hands are. To tackle this, we assign weights based on the similarity scores within the mini-batch in the contrastive learning loss. The weights are designed to have higher values as the similarity of the pairs increases. This allows the optimization of contrastive learning to explicitly consider the proximity of samples, beyond binary discrimination between positives and negatives.

In the figure below, hand images (I, I+, I−) and their corresponding 2D keypoints are input to the model. After applying random augmentations through transformation, both the images and 2D keypoints are spatially transformed. The altered 2D keypoints are then used to compute adaptive weights, which guide contrastive learning by strengthening or weakening the alignment between positive and negative samples.

Results

We validate the effectiveness of the pre-trained networks by fine-tuning on several datasets for 3D hand pose estimation, namely FreiHand, DexYCB, and AssemblyHands. Our proposed method consistently outperforms conventional contrastive learning methods, SimCLR and PeCLR. The figure below demonstrates the performance gain of our method over the comparative pre-training methods on the three fine-tuning datasets.


© Takehiko Ohkawa

< Home