With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Evangelos Kazakos¹, Jaesung Huh², Arsha Nagrani†², Andrew Zisserman² and Dima Damen¹

¹University of Bristol, Dept. of Computer Science, ²University of Oxford, VGG

Overview

Abstract

In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions.

Video

Downloads

Paper [arXiv]
Code and models [GitHub]

Bibtex

@INPROCEEDINGS{kazakos2021MTCN,
  author={Kazakos, Evangelos and Huh, Jaesung and Nagrani, Arsha and Zisserman, Andrew and Damen, Dima},
  booktitle={British Machine Vision Conference (BMVC)},
  title={With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition},
  year={2021}}

Acknowledgements

This work used public datasets and is supported by EPSRC Doctoral Training Program, EPSRC UMPIRE (EP/T004991/1) and the EPSRC Programme Grant VisualAI (EP/T028572/1). Jaesung Huh is funded by a Global Korea Scholarship.

^†Now at Google Research

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition (BMVC 2021)

Abstract

Video

Downloads

Bibtex

Acknowledgements