Evangelos (Vangelis) Kazakos

PhD in Computer Science, currently a postdoctoral researcher at CIIRC @ CTU

IMG_0843.jpg

Office: B-637

Building: Czech Institute of Informatics, Robotics and Cybernetics (CIIRC @ CTU)

Address: Jugoslávských partyzánů 1580/3, 160 00

Email: evangelos [dot] kazakos [at] cvut [dot] cz

Welcome to my webpage! I’m Vangelis. I hold a PhD from the University of Bristol, where my thesis focused on Audio-Visual Egocentric Action Recognition, highlighting the essential role of audio in understanding egocentric actions captured through wearable sensors. I was also a key contributor to the EPIC-KITCHENS dataset, which has since become a cornerstone in the field of egocentric vision research.

Currently, I am a postdoctoral researcher at the Czech Institute of Informatics, Robotics, and Cybernetics (CIIRC) at CTU Prague. My research is focused on multimodal learning and architectures, with a particular emphasis on spatio-temporal grounding using video and language data.

news

Dec 23, 2025 Our paper, REALM, a real-to-sim validated benchmark for robot manipulation, is on arXiv. [arXiv] [Project webpage] [Code]
Dec 16, 2025 I’m co-organising the first AI for Peace workshop @ ICLR 2026. [Workshop webpage]
Sep 02, 2025 We’ve released grove-transformers, a lightweight, inference-only interface for our GROVE model, implemented with 🤗 Transformers. [🤗 link] [GitHub link]
Aug 21, 2025 We’ve released datasets (HowToGround1M and iGround), checkpoints and code for our work “Large-scale Pre-training for Grounded Video Caption Generation”. [Project webpage] [Code, checkpoints, data]
Jul 22, 2025 I’ve been invited to present our work “Large-scale Pre-training for Grounded Video Caption Generation” at the EuroHPC User Days event on the Artificial Intelligence for Science parallel session, scheduled for the 1 October 2025 at 11:30. [Event link]
Jun 26, 2025 Our work “Large-scale Pre-training for Grounded Video Caption Generation” has been accepted to ICCV 2025!
May 15, 2025 I was nominated as Outstanding Reviewer for CVPR 2025. [link]
Apr 18, 2025 I presented our work “Large-scale Pre-training for Grounded Video Caption Generation” at the weekly webinar of TwelveLabs. [YouTube link]
Mar 13, 2025 Our paper “Large-scale Pre-training for Grounded Video Caption Generation” is now on arXiv. [arXiv] [Project webpage] [Code] (available soon, stay tuned!)
Feb 28, 2025 I received the 2024 IJCV Outstanding Reviewer Award. Announcement
Nov 01, 2024 I started a new role as a Postdoctoral Researcher at Czech Institute of Informatics, Robotics and Cybernetics (CIIRC) at CTU in Prague. My research will focus on multimodal understanding using video and language
Feb 27, 2024 Our paper with title “TIM: A Time Interval Machine for Audio-Visual Action Recognition” has been accepted at CVPR 2024 [paper] [project page]
Jan 18, 2024 Our paper with title “Graph Guided Question Answer Generation for Procedural Question-Answering” has been accepted at EACL 2024 [paper]
Feb 17, 2023 Our paper with title “Epic-sounds: A large-scale dataset of actions that sound” has been accepted at ICASSP 2023 [paper] [project page]
May 30, 2022 I joined Samsung AI Center in Cambridge as a Research Scientist
Apr 27, 2022 I successfully defended my PhD dissertation with title “Audio-Visual Egocentric Action Recognition[link]

selected publications

  1. Large-scale Pre-training for Grounded Video Caption Generation
    Evangelos Kazakos, Cordelia Schmid, and Josef Sivic
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2025
  2. TIM: A Time Interval Machine for Audio-Visual Action Recognition
    Jacob Chalk, Jaesung Huh, Evangelos Kazakos, and 2 more authors
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
  3. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, and 8 more authors
    International Journal of Computer Vision, 2022
  4. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
    Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, and 2 more authors
    In British Machine Vision Conference (BMVC), 2021
  5. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
    Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and 1 more author
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2019