Evangelos Kazakos1, Jaesung Huh2, Arsha Nagrani†2, Andrew Zisserman2 and Dima Damen1

1University of Bristol, Dept. of Computer Science, 2University of Oxford, VGG



In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions.




  author={Kazakos, Evangelos and Huh, Jaesung and Nagrani, Arsha and Zisserman, Andrew and Damen, Dima},
  booktitle={British Machine Vision Conference (BMVC)},
  title={With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition},


This work used public datasets and is supported by EPSRC Doctoral Training Program, EPSRC UMPIRE (EP/T004991/1) and the EPSRC Programme Grant VisualAI (EP/T028572/1). Jaesung Huh is funded by a Global Korea Scholarship.

Now at Google Research