Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about the audio augmentation and evaluation setting for audio-vidual mid-fusion #1

Open
Rick-Xu315 opened this issue Aug 9, 2022 · 2 comments

Comments

@Rick-Xu315
Copy link

Hi, thanks for your work on AV FGC task. I'd like to inquire about some experiment details in your paper:

  1. In 4.1-Audio-Modality in your paper, you use logit average as evaluation strategy, but in 4.1-Audio-Visual-fusion part, you might not use the same averaging strategy for mid-fusion because they conflict, and I'd like to know how you process the audio in mid-fusion part. Do you use only one spectrogram in the mid-fusion part? If so, is the uni-modal and multi-modal result comparable?
  2. In mid-fusion part you use MBT as a SOTA fusion method, I'd like to know do you ever try on other simple mid-fusion method like concatenation、summation or Gated.
  3. For audio agmentation method, is it possible that you could include relevant code when you release your experiment pipeline. I would appreciate that.
@rui1996
Copy link
Collaborator

rui1996 commented Aug 17, 2022

Thank you for your interest in our work! Regarding your questions:

  1. It is true that for multimodal fusion, one video input can only interact with one audio spec input. However, we could adopt multiple views for one video and multiple views for one audio spec. We fuse a pair of them each time and finally average the logits. This is what we do and we consider it is comparable.
  2. Since we are using transformer as the backbone, MBT or token concatenation would be the most straightforward way. In MBT paper, it shows concatenation is not as good as MBT, so we just adopt MBT.
  3. Yes we will release the code. For your question, the code looks like this:
def freq_masking(self, img, freq_factor=1.0, mask_len=15):
		factor = np.random.RandomState().rand()
		freq_len = img.shape[0]
		if factor <= freq_factor:
			start = np.random.randint(0, freq_len - mask_len)
			interval = np.random.randint(0, mask_len)
			img[start : start + interval, :] = 0
		return img

def time_masking(self, img, time_factor=1.0, mask_len=15):
		factor = np.random.RandomState().rand()
		time_len = img.shape[1]
		if factor <= time_factor:
			start = np.random.randint(0, time_len - mask_len)
			interval = np.random.randint(0, mask_len)
			img[:, start : start + interval] = 0
		return img

@Rick-Xu315
Copy link
Author

Thanks for your reply! I would also like to ask another question:
In your paper you mention in table 3 that you get a lower performance of audio resnet 18 after finetuning on video-audio. I find similar result after we finetune the concat-based av model composed of pretrained unimodels and linear probe the audio backbone. I would like to know your opinions why the audio backbone gets worse after finetuning. Great thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants