Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

Takehiko Ohkawa^1,2 Yu-Jhe Li² Qichen Fu² Ryosuke Furuta¹ Kris M. Kitani² Yoichi Sato¹

¹The University of Tokyo ²Carnegie Mellon University

European Conference on Computer Vision (ECCV), 2022

[Paper] [Slides]

Abstract

We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.

Overview

We present our proposed self-training domain adaptation with confidence estimation for adapting both hand keypoint regression and hand segmentation. Our method consists of two different networks (teachers) for confidence estimation and another network (student) for self-training of both tasks. The student training is based on confidence-aware consistency training under geometric augmentation. The student learns from the consistency between its prediction for augmented target data and the two teachers' predictions for unaugmented target data. The training is weighted by the target confidence computed by the divergence of both teachers. The teachers are trained with knowledge distillation so that each teacher independently learns to match the student's predictions.

Results

Our method can successfully adapt to in-the-wild egocentric videos (Ego4D) that have different imaging conditions from the source domain and lack the annotation of hand keypoints and pixels. The bellow figure shows the adaptation results on outdoor scenarios. Please check our paper for further results and comparisons.

< Home