[NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

🔥 [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!

Introduction

This repository contains the official implementation code of the paper "Self-adaptive Sampling for Efficient Video Question Answering on Image--Text Models". In this work we introduce and study two simple sampling strategies (MIF and MDF) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).

We first systematically test the performance of MIF (Most Implied Frames) with varied backbone models as captioner and scorer. They collaborate to perform a "question-and-vision-aware" sampling. Then we draw inspiration from the results and analysis to further propose the more lightweight MDF (Most Dominant Frames), which takes one more step to discard the correlation of question and executs a "question-agnostic, vision-aware" sampling. This routine significantly boosts the efficiency and gains competative or higher performance on the tested datasets.

Once running completes, sampled frames will be saved in a hdf5 (.h5) file as a "dataset" for fast loading during training and test time. We test our methods on three models (CLIP, GIT and All-in-one) and 4 datasets (MSVD-QA, MSRVTT-QA, TGIF-Frame, NeXT-QA). The implementation on CLIP (including our refined structure CLIP-Dec which significantly enhances the performance on raw-CLIP) and GIT are in the folder clip_and_git, while the implementation on All-in-one are under the folder all_in_one.

Usage

1. Downloading Datasets

Please visit the corresponding repository and follow the instruction there to download the datasets.

MSVD and MSRVTT
TGIF
NExT-QA

The suggested path to store these datasets is "model/dataset/<dataset_name>"

2. Preprocessing

The code to do sampling for all three models is same, under the folder "clip_and_git/src/preprocessing".

To sample via MDF method, run the python script as follows:
```
python extract_features.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --model_name=<vlm_model_name> ... (other hps)
```
If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.

To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence

python extract_features.py --sampling_strategy='uni' --K 16 ...

Then run the python script to capture and start sampling

python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_cap'

python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_inds'

3. Training and Inference

For experiments on CLIP and GIT, please modify our provided reference scripts (in src/scripts). For all-in-one, please check its attached README file for more details.

Results (Partial)

The following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.

CLIP-Dec (3 Frame)

Sampling	MSVD-QA	MSRVTT-QA	TGIF-Frame
noDec	27.7	30.3	42.8
Uniform	33.8	33.7	47.2
MDF	35.0	35.2	63.2
MIF	35.0	35.4	61.8

GIT-Base (6 Frame)

Sampling	MSVD-QA	MSRVTT-QA	TGIF-Frame
Report	51.2	41.0	69.1
Uniform	52.2	41.1	67.5
MDF	55.3	42.0	69.9
MIF	54.5	42.3	69.6

AIO-Base (3 Frame)

Sampling	MSVD-QA	MSRVTT-QA	TGIF-Frame
Report	46.5	42.9	64.2
Reprd.	46.1	42.7	64.0
MDF	46.9	43.8	66.2
MIF	46.7	44.0	65.9

AIO-Base+ on Next-QA (3 Frame)

Method	Val	Test
Base	48.4	48.1
MIF	49.7	49.5
MDF	50.2	49.8

BLIP2-T5XXL on Next-QA (3 Frame)

Method	Val	Test
Base	60.1	59.7
MIF	61.5	61.2
MDF	61.8	61.1

Citation

Please cite our paper if you find this project is related to your work

@inproceedings{han2024self,
  title={Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models},
  author={Han, Wei and Chen, Hui and Kan, Min-Yen and Poria, Soujanya},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},
  pages={2522--2534},
  year={2024}
}

Acknowledgement

Code for AIO is adapted from AIO official implementation

Contact

If you have any enquiries about our code and paper, feel free to contact us at henryhan88888@gmail.com or chchenhui1996@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

[NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

Introduction

Usage

1. Downloading Datasets

2. Preprocessing

3. Training and Inference

Results (Partial)

CLIP-Dec (3 Frame)

GIT-Base (6 Frame)

AIO-Base (3 Frame)

AIO-Base+ on Next-QA (3 Frame)

BLIP2-T5XXL on Next-QA (3 Frame)

Citation

Acknowledgement

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

[NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

Introduction

Usage

1. Downloading Datasets

2. Preprocessing

3. Training and Inference

Results (Partial)

CLIP-Dec (3 Frame)

GIT-Base (6 Frame)

AIO-Base (3 Frame)

AIO-Base+ on Next-QA (3 Frame)

BLIP2-T5XXL on Next-QA (3 Frame)

Citation

Acknowledgement

Contact