Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
metafile.yml		metafile.yml
otter-9b_caption.py		otter-9b_caption.py
otter-9b_vqa.py		otter-9b_vqa.py

README.md

Otter

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Abstract

Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.

How to use it?

Use the model

import torch
from mmpretrain import get_model, inference_model

model = get_model('otter-9b_3rdparty_caption', pretrained=True, device='cuda', generation_cfg=dict(max_new_tokens=50))
out = inference_model(model, 'demo/cat-dog.png')
print(out)
# {'pred_caption': 'The image features two adorable small puppies sitting next to each other on the grass. One puppy is brown and white, while the other is tan and white. They appear to be relaxing outdoors, enjoying each other'}

Test Command

Prepare your dataset according to the docs.

Test:

python tools/test.py configs/otter/otter-9b_caption.py https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth

Models and results

Image Caption on COCO

Model	Params (M)	BLEU-4	CIDER	Config	Download
`otter-9b_3rdparty_caption`*	8220.45	Upcoming	Upcoming	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Visual Question Answering on VQAv2

Model	Params (M)	Accuracy	Config	Download
`otter-9b_3rdparty_vqa`*	8220.45	Upcoming	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Citation

@article{li2023otter,
  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
  journal={arXiv preprint arXiv:2305.03726},
  year={2023}
}

@article{li2023mimicit,
    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
    year={2023},
    eprint={2306.05425},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otter

otter

README.md

Otter

Abstract

How to use it?

Models and results

Image Caption on COCO

Visual Question Answering on VQAv2

Citation

Files

otter

Directory actions

More options

Directory actions

More options

Latest commit

History

otter

Folders and files

parent directory

README.md

Otter

Abstract

How to use it?

Models and results

Image Caption on COCO

Visual Question Answering on VQAv2

Citation