You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your work on AV FGC task. I'd like to inquire about some experiment details in your paper:
In 4.1-Audio-Modality in your paper, you use logit average as evaluation strategy, but in 4.1-Audio-Visual-fusion part, you might not use the same averaging strategy for mid-fusion because they conflict, and I'd like to know how you process the audio in mid-fusion part. Do you use only one spectrogram in the mid-fusion part? If so, is the uni-modal and multi-modal result comparable?
In mid-fusion part you use MBT as a SOTA fusion method, I'd like to know do you ever try on other simple mid-fusion method like concatenation、summation or Gated.
For audio agmentation method, is it possible that you could include relevant code when you release your experiment pipeline. I would appreciate that.
The text was updated successfully, but these errors were encountered:
Thank you for your interest in our work! Regarding your questions:
It is true that for multimodal fusion, one video input can only interact with one audio spec input. However, we could adopt multiple views for one video and multiple views for one audio spec. We fuse a pair of them each time and finally average the logits. This is what we do and we consider it is comparable.
Since we are using transformer as the backbone, MBT or token concatenation would be the most straightforward way. In MBT paper, it shows concatenation is not as good as MBT, so we just adopt MBT.
Yes we will release the code. For your question, the code looks like this:
Thanks for your reply! I would also like to ask another question:
In your paper you mention in table 3 that you get a lower performance of audio resnet 18 after finetuning on video-audio. I find similar result after we finetune the concat-based av model composed of pretrained unimodels and linear probe the audio backbone. I would like to know your opinions why the audio backbone gets worse after finetuning. Great thanks!
Hi, thanks for your work on AV FGC task. I'd like to inquire about some experiment details in your paper:
The text was updated successfully, but these errors were encountered: