Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extend this to more modalities? #11

Open
nikhilbyte opened this issue Nov 25, 2024 · 3 comments
Open

How to extend this to more modalities? #11

nikhilbyte opened this issue Nov 25, 2024 · 3 comments

Comments

@nikhilbyte
Copy link

Hey, thanks a lot for your work!
I want to extend this model to more modalities(audio and video along with text and images). How difficult would that be? Also, if possible, how will that ?

@XMHZZ2018
Copy link
Contributor

XMHZZ2018 commented Nov 25, 2024

Thank you for your interest in our work! Many existing VLMs, including Phi-3.5-V (which we used), are capable of handling multi-image and video content. Therefore, extending support to the video modality should be straightforward. However, some additional fine-tuning might be necessary, as our current VLM2Vec implementation only use single-image inputs as training data.

As for audio, I’m not entirely sure about its integration. One possible approach could be to use an audio embedding model and then apply a late fusion layer to combine it with text, image, or video embeddings.

@nikhilbyte
Copy link
Author

Thanks a lot for the quick response @XMHZZ2018.
You're right about a lot of VLMs/MLMs supporting videos. But what I'm looking for is an embedding framework where it supports N modalities. That N currently includes text, video, and images. The idea of a late fusion layer seems interesting, can you explain or maybe nudge me in a direction where I could make this change in this particular codebase to include more modalities? I'm down to contribute and grow/scale this to more modalities!
Thanks in advance :)

@XMHZZ2018
Copy link
Contributor

@nikhilbyte

I believe the VLM2Vec framework supports text, video, and images, though additional fine-tuning may be required for video data. However, our current framework does not support audio, as audio data cannot be incorporated into the input in an interleaved format.

Regarding the late fusion layer, this approach is widely used in previous multimodal embedding models. For instance, one encoder processes text to generate text features, while another encoder processes images to generate image features. An additional fusion layer (e.g., MLP, self-attention, or even a simple averaging method) then combines these features. This method could potentially be extended to include audio features. For other modalities, I believe VLM2Vec is capable of providing support.

Please let me know if this answers your question, and feel free to reach out if you'd like assistance in scaling to more modalities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants