This repository provides the code used to implement the model proposed in the paper:
Jacob Chalk*, Jaesung Huh*, Evangelos Kazakos, Andrew Zisserman, Dima Damen, TIM: A Time Interval Machine for Audio-Visual Action Recognition, CVPR, 2024
(* indicates equal contribution.)
When using this code, please reference:
@InProceedings{Chalk2024TIM,
author = {Chalk, Jacob and Huh, Jaesung and Kazakos, Evangelos and Zisserman, Andrew and Damen, Dima},
title = {{TIM}: {A} {T}ime {I}nterval {M}achine for {A}udio-{V}isual {A}ction {R}ecognition},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024}
}
The requirements for TIM can be installed in a separate conda environment by running the following command in your terminal: conda env create -f environment.yml
. You can then activate this with conda activate TIM
.
NOTE: This environment only applies to the recognition
and detection
folders. Seperate requirements are listed for the backbones in the feature_extractors
folder.
The features used for this project can be extracted by following the instructions in the feature_extractors
folder.
You can find links to the relevant pre-trained models in the recognition, feature_extractors and detection folders.
We provide the necessary ground-truth files for all datasets here.
The link contains a zip containing ground truth data for each dataset, consisting of:
- The training split ground truth
- The validation split ground truth
- The video metadata of the dataset
- The feature time intervals for training and valdiation splits
NOTE: These annotation files have been cleaned to be compatible with the TIM codebase.
We provide instructions on how to train and evaluate TIM for both recognition and detection in the respective folders.
The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.