Human hand interactions are central to daily activities and communication, providing informative signals for human action, expression, and intent. Visual perception and modeling of hands from images and videos are therefore crucial for various applications, including understanding human interactions in the wild, virtual human modeling in 3D, human augmentation through assistive vision systems, and highly dexterous robotic manipulation.
While computer vision has evolved to estimate hand states ranging from coarse detection to nuanced 3D pose and shape estimation, existing methods fall short in three key challenges. First, they struggle to handle complicated and fine-grained contact scenarios, such as grasping objects (i.e., hand-object contact) or touching one's own body (i.e., self-contact), where occlusions and deformations introduce substantial ambiguity. Second, most models generalize poorly to dynamic and real-world environments due to the domain gap between studio-collected training datasets and in-the-wild testing conditions. Third, beyond geometric estimation, current approaches often lack the ability to link low-level hand states to high-level semantic comprehension, e.g., with action labels or language.
This dissertation addresses these limitations by pursuing the goal of precise tracking and interpretation of fine-grained hand interactions from real-world visual data. To achieve this, the dissertation systematically proposes three key pillars:
The first pillar of this research focuses on building a diverse and high-quality data foundation. This involves investigating and capturing hand interaction datasets that include complex contacts, such as object interaction and self-contact, using multi-camera systems. This approach enables high-precision 3D pose and shape annotations for intricate hand contact, while offering valuable assets to the community.
The second pillar is dedicated to robust modeling to capture fine details. Leveraging the constructed datasets, we develop advanced machine learning methods for generalizable and adaptable estimation in the wild. This includes comprehensive analysis of 3D hand pose estimation tasks during object contact, building model priors from diverse image or pose data for downstream tasks in pose estimation, and proposing adaptation methods to further bridge performance gaps across different recording environments and camera settings.
The third pillar centers on connecting the captured low-level geometric information with high-level semantic understanding. This involves utilizing the predictions for hand geometry in 2D or 3D (e.g., detection, segmentation, and pose) to comprehend the semantics of the interaction. We explore natural language descriptions as semantic signals and propose generating dense video captions from the hand-object tracklets.
Collectively, these three pillars present a consistent and integrated framework to advance the visual understanding of hands in interactions. By combining advanced techniques in data foundation, robust modeling, and semantic understanding, this dissertation contributes to foundational technologies and intelligent systems for human-centric interactions, with broader applications and implications in computer vision.
Note that * indicates co-first authors.
@phdthesis{ohkawa:phd2025,
title = {Visual Understanding of Human Hands in Interactions},
author = {Takehiko Ohkawa},
school = {The University of Tokyo},
year = {2025}
}
© Takehiko Ohkawa |