-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to extend this to more modalities? #11
Comments
Thank you for your interest in our work! Many existing VLMs, including Phi-3.5-V (which we used), are capable of handling multi-image and video content. Therefore, extending support to the video modality should be straightforward. However, some additional fine-tuning might be necessary, as our current VLM2Vec implementation only use single-image inputs as training data. As for audio, I’m not entirely sure about its integration. One possible approach could be to use an audio embedding model and then apply a late fusion layer to combine it with text, image, or video embeddings. |
Thanks a lot for the quick response @XMHZZ2018. |
I believe the VLM2Vec framework supports text, video, and images, though additional fine-tuning may be required for video data. However, our current framework does not support audio, as audio data cannot be incorporated into the input in an interleaved format. Regarding the late fusion layer, this approach is widely used in previous multimodal embedding models. For instance, one encoder processes text to generate text features, while another encoder processes images to generate image features. An additional fusion layer (e.g., MLP, self-attention, or even a simple averaging method) then combines these features. This method could potentially be extended to include audio features. For other modalities, I believe VLM2Vec is capable of providing support. Please let me know if this answers your question, and feel free to reach out if you'd like assistance in scaling to more modalities. |
Hey, thanks a lot for your work!
I want to extend this model to more modalities(audio and video along with text and images). How difficult would that be? Also, if possible, how will that ?
The text was updated successfully, but these errors were encountered: