How to extend this to more modalities? #11

nikhilbyte · 2024-11-25T08:37:29Z

Hey, thanks a lot for your work!
I want to extend this model to more modalities(audio and video along with text and images). How difficult would that be? Also, if possible, how will that ?

XMHZZ2018 · 2024-11-25T11:52:18Z

Thank you for your interest in our work! Many existing VLMs, including Phi-3.5-V (which we used), are capable of handling multi-image and video content. Therefore, extending support to the video modality should be straightforward. However, some additional fine-tuning might be necessary, as our current VLM2Vec implementation only use single-image inputs as training data.

As for audio, I’m not entirely sure about its integration. One possible approach could be to use an audio embedding model and then apply a late fusion layer to combine it with text, image, or video embeddings.

nikhilbyte · 2024-11-26T13:24:04Z

Thanks a lot for the quick response @XMHZZ2018.
You're right about a lot of VLMs/MLMs supporting videos. But what I'm looking for is an embedding framework where it supports N modalities. That N currently includes text, video, and images. The idea of a late fusion layer seems interesting, can you explain or maybe nudge me in a direction where I could make this change in this particular codebase to include more modalities? I'm down to contribute and grow/scale this to more modalities!
Thanks in advance :)

XMHZZ2018 · 2024-11-30T01:05:52Z

@nikhilbyte

I believe the VLM2Vec framework supports text, video, and images, though additional fine-tuning may be required for video data. However, our current framework does not support audio, as audio data cannot be incorporated into the input in an interleaved format.

Regarding the late fusion layer, this approach is widely used in previous multimodal embedding models. For instance, one encoder processes text to generate text features, while another encoder processes images to generate image features. An additional fusion layer (e.g., MLP, self-attention, or even a simple averaging method) then combines these features. This method could potentially be extended to include audio features. For other modalities, I believe VLM2Vec is capable of providing support.

Please let me know if this answers your question, and feel free to reach out if you'd like assistance in scaling to more modalities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extend this to more modalities? #11

How to extend this to more modalities? #11

nikhilbyte commented Nov 25, 2024

XMHZZ2018 commented Nov 25, 2024 •

edited

Loading

nikhilbyte commented Nov 26, 2024

XMHZZ2018 commented Nov 30, 2024

How to extend this to more modalities? #11

How to extend this to more modalities? #11

Comments

nikhilbyte commented Nov 25, 2024

XMHZZ2018 commented Nov 25, 2024 • edited Loading

nikhilbyte commented Nov 26, 2024

XMHZZ2018 commented Nov 30, 2024

XMHZZ2018 commented Nov 25, 2024 •

edited

Loading