We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pre-training and fine-tuning stages. Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe egocentric videos in natural language.
We newly collect an egocentric dense video captioning dataset, EgoYC2. The EgoYC2 captions follow the caption definition of YouCook2 (YC2). This ensures that the two datasets are uniform in caption content and granularity, and are evaluated consistently. Specifically, we directly adopt the procedural captions from YC2, describing the sequence of necessary steps to complete complex tasks. We then re-record these cooking videos by instructing participants wearing a head-mounted camera to cook while referring to the YC2’s captions (recipes).
We aim to learn view-invariant features among the mixed source views and the unique egocentric view. We perform pre-training and fine-tuning separately to handle a larger domain gap.
To address this, we employ a "divide-and-conquer" approach by breaking down a large domain gap into smaller gaps and adapting between the smaller gaps step-by-step, dubbed gradual domain adaptation. Specifically, we introduce an intermediate domain (i.e., ego-like) in the source data that shares visual similarity with the target data (i.e., ego).
This encourages us to resolve the large gap gradually from exo-view to ego-like view and finally to ego view.
We facilitate the learning of invariant features with adversarial training, while gradually adapting from exo to ego-like and finally to ego view. The video features are converted to learnable representation by the converter F. These features are fed to the task network G to solve the captioning task and the classifier C to let the converted features be view-independent. The classifier C is trained by adversarial adaptation with a gradient reversal layer. The converter F attempts to produce features undistinguished by the classifier while the classifier is trained to classify their views.
The figure below shows qualitative results of generated captions and predicted time segments. For time segmentation, while PT+FT and PT + VI-FT generate overlapped segments and overlook some segments, VI-PT + FT and our full method correct these localization errors. In generated captions, we observe several failure patterns: appearing unrelated ingredients and duplicate captions. The captions of PT+FT and PT + VI-FT include the unrelated ingredients (e.g., “tofu”, “pork”, and “udon noodles”). We also find repeated sentences (triangles) in the models without the view-invariant fine-tuning (i.e., PT+FT and VI-PT + FT). These observations suggest that the view-invariant pre-training reduces the mixing of unrelated ingredients, and the later view-invariant fine-tuning helps produce fewer repeated sentences.
© Takehiko Ohkawa |