Skip to content

Collection of AWESOME vision-language models for vision tasks

Notifications You must be signed in to change notification settings

jingyi0000/VLM_survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 

Repository files navigation

Awesome Vision-Language Models Awesome

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey [Paper]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

🤩 Our paper is selected into TPAMI Top 50 Popular Paper List !!

arXiv Maintenance PR's Welcome

Feel free to pull requests or contact us if you find any related papers that are not included here.

The process to submit a pull request is as follows:

  • a. Fork the project into your own repository.
  • b. Add the Title, Paper link, Conference, Project/Code link in README.md using the following format:
  |[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|
  • c. Submit the pull request to this branch.

🔥 News

Last update on 2024/11/26

VLM Pre-training Methods

  • [arXiv 2024] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper][Code]
  • [CVPR 2024] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback [Paper][Code]
  • [CVPR 2024] Do Vision and Language Encoders Represent the World Similarly? [Paper][Code]
  • [CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking [Paper][Code]
  • [CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models [Paper]
  • [CVPR 2024] ViTamin: Designing Scalable Vision Models in the Vision-Language Era [Paper][Code]
  • [CVPR 2024] Iterated Learning Improves Compositionality in Large Vision-Language Models [Paper]
  • [CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning [Paper][Code]
  • [CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [Paper][Code]
  • [CVPR 2024] VILA: On Pre-training for Visual Language Models [Paper]
  • [CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection [Paper][Code]
  • [CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions [Paper]
  • [ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [Paper][Code]
  • [ICLR 2024] MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [Paper][Code]
  • [ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models [Paper]

VLM Transfer Learning Methods

  • [NeurIPS 2024] Historical Test-time Prompt Tuning for Vision Foundation Models [Paper]
  • [NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation [Paper][Code]
  • [IJCV 2024] Progressive Visual Prompt Learning with Contrastive Feature Re-formation [Paper][Code]
  • [ECCV 2024] CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [Paper][Code]
  • [ECCV 2024] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [Paper][Code]
  • [ECCV 2024] GalLoP: Learning Global and Local Prompts for Vision-Language Models [Paper]
  • [ECCV 2024] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [Paper][Code]
  • [CVPR 2024] Towards Better Vision-Inspired Vision-Language Models [Paper]
  • [CVPR 2024] One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models [Paper][Code]
  • [CVPR 2024] Any-Shift Prompting for Generalization over Distributions [Paper]
  • [CVPR 2024] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models [Paper][Code]
  • [CVPR 2024] Anchor-based Robust Finetuning of Vision-Language Models [Paper]
  • [CVPR 2024] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners [Paper][Code]
  • [CVPR 2024] Visual In-Context Prompting [Paper][Code]
  • [CVPR 2024] TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model [Paper][Code]
  • [CVPR 2024] Efficient Test-Time Adaptation of Vision-Language Models [Paper][Code]
  • [CVPR 2024] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models [Paper][Code]
  • [ICLR 2024] DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning [Paper][Code]
  • [ICLR 2024] Nemesis: Normalizing the soft-prompt vectors of vision-language models [Paper]
  • [ICLR 2024] Prompt Gradient Projection for Continual Learning [Paper]
  • [ICLR 2024] An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [Paper]
  • [ICLR 2024] Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [Paper][Code]
  • [ICLR 2024] Text-driven Prompt Generation for Vision-Language Models in Federated Learning [Paper]
  • [ICLR 2024] Consistency-guided Prompt Learning for Vision-Language Models [Paper]
  • [ICLR 2024] C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion [Paper]
  • [arXiv 2024] Learning to Prompt Segment Anything Models [Paper]

VLM Knowledge Distillation for Detection

  • [NeurIPS 2024] Open-Vocabulary Object Detection via Language Hierarchy [Paper]
  • [CVPR 2024] RegionGPT: Towards Region Understanding Vision Language Model [Paper][Code]
  • [ICLR 2024] LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [Paper]
  • [ICLR 2024] Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction [Paper]

VLM Knowledge Distillation for Segmentation

  • [ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper]

VLM Knowledge Distillation for Other Vision Tasks

  • [ICLR 2024] FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition [Paper][Project]
  • [ICLR 2024] AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection [Paper][Code]
  • [CVPR 2023] EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata [Paper][Code]

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2024vision,
  title={Vision-language models for vision tasks: A survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Menu

Datasets

Datasets for VLM Pre-training

Dataset Year Num of Image-Text Paris Language Project
SBU Caption 2011 1M English Project
COCO Caption 2016 1.5M English Project
Yahoo Flickr Creative Commons 100 Million 2016 100M English Project
Visual Genome 2017 5.4M English Project
Conceptual Captions 3M 2018 3.3M English Project
Localized Narratives 2020 0.87M English Project
Conceptual 12M 2021 12M English Project
Wikipedia-based Image Text 2021 37.6M 108 Languages Project
Red Caps 2021 12M English Project
LAION400M 2021 400M English Project
LAION5B 2022 5B Over 100 Languages Project
WuKong 2022 100M Chinese Project
CLIP 2021 400M English -
ALIGN 2021 1.8B English -
FILIP 2021 300M English -
WebLI 2022 12B English -

Datasets for VLM Evaluation

Image Classification

Dataset Year Classes Training Testing Evaluation Metric Project
MNIST 1998 10 60,000 10,000 Accuracy Project
Caltech-101 2004 102 3,060 6,085 Mean Per Class Project
PASCAL VOC 2007 2007 20 5,011 4,952 11-point mAP Project
Oxford 102 Flowers 2008 102 2,040 6,149 Mean Per Class Project
CIFAR-10 2009 10 50,000 10,000 Accuracy Project
CIFAR-100 2009 100 50,000 10,000 Accuracy Project
ImageNet-1k 2009 1000 1,281,167 50,000 Accuracy Project
SUN397 2010 397 19,850 19,850 Accuracy Project
SVHN 2011 10 73,257 26,032 Accuracy Project
STL-10 2011 10 1,000 8,000 Accuracy Project
GTSRB 2011 43 26,640 12,630 Accuracy Project
KITTI Distance 2012 4 6,770 711 Accuracy Project
IIIT5k 2012 36 2,000 3,000 Accuracy Project
Oxford-IIIT PETS 2012 37 3,680 3,669 Mean Per Class Project
Stanford Cars 2013 196 8,144 8,041 Accuracy Project
FGVC Aircraft 2013 100 6,667 3,333 Mean Per Class Project
Facial Emotion 2013 8 32,140 3,574 Accuracy Project
Rendered SST2 2013 2 7,792 1,821 Accuracy Project
Describable Textures 2014 47 3,760 1,880 Accuracy Project
Food-101 2014 101 75,750 25,250 Accuracy Project
Birdsnap 2014 500 42,283 2,149 Accuracy Project
RESISC45 2017 45 3,150 25,200 Accuracy Project
CLEVR Counts 2017 8 2,000 500 Accuracy Project
PatchCamelyon 2018 2 294,912 32,768 Accuracy Project
EuroSAT 2019 10 10,000 5,000 Accuracy Project
Hateful Memes 2020 2 8,500 500 ROC AUC Project
Country211 2021 211 43,200 21,100 Accuracy Project

Image-Text Retrieval

Dataset Year Classes Training Testing Evaluation Metric Project
Flickr30k 2014 - 31,783 - Recall Project
COCO Caption 2015 - 82,783 5,000 Recall Project

Action Recognition

Dataset Year Classes Training Testing Evaluation Metric Project
UCF101 2012 101 9,537 1,794 Accuracy Project
Kinetics700 2019 700 494,801 31,669 Mean (top1, top5) Project
RareAct 2020 122 7,607 - mWAP, mSAP Project

Object Detection

Dataset Year Classes Training Testing Evaluation Metric Project
COCO 2014 Detection 2014 80 83,000 41,000 Box mAP Project
COCO 2017 Detection 2017 80 118,000 5,000 Box mAP Project
LVIS 2019 1203 118,000 5,000 Box mAP Project
ODinW 2022 314 132,413 20,070 Box mAP Project

Semantic Segmentation

Dataset Year Classes Training Testing Evaluation Metric Project
PASCAL VOC 2012 2012 20 1,464 1,449 mIoU Project
PASCAL Content 2014 459 4,998 5,105 mIoU Project
Cityscapes 2016 19 2,975 500 mIoU Project
ADE20k 2017 150 25,574 2,000 mIoU Project

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

Paper Published in Code/Project
CLIP: Learning Transferable Visual Models From Natural Language Supervision ICML 2021 Code
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ICML 2021 -
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation arXiv 2021 Code
Florence: A New Foundation Model for Computer Vision arXiv 2021 -
RegionClip: Region-based Language-Image Pretraining arXiv 2021 Code
DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm ICLR 2022 Code
FILIP: Fine-grained Interactive Language-Image Pre-Training ICLR 2022 -
KELIP: Large-scale Bilingual Language-Image Contrastive Learning ICLRW 2022 Code
ZeroVL: Contrastive Vision-Language Pre-training with Limited Resources ECCV 2022 Code
SLIP: Self-supervision meets Language-Image Pre-training ECCV 2022 Code
UniCL: Unified Contrastive Learning in Image-Text-Label Space CVPR 2022 Code
LiT: Zero-Shot Transfer with Locked-image text Tuning CVPR 2022 Code
GroupViT: Semantic Segmentation Emerges from Text Supervision CVPR 2022 Code
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining NeurIPS 2022 -
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training NeurIPS 2022 -
K-LITE: Learning Transferable Visual Models with External Knowledge NeurIPS 2022 Code
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone NeurIPS 2022 Code
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese arXiv 2022 Code
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities arXiv 2022 Code
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation arXiv 2022 Code
NLIP: Noise-robust Language-Image Pre-training AAAI 2023 -
PaLI: A Jointly-Scaled Multilingual Language-Image Model ICLR 2023 Project
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention ICLR 2023 Code
CLIPPO: Image-and-Language Understanding from Pixels Only CVPR 2023 Code
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training CVPR 2023 -
DeAR: Debiasing Vision-Language Models with Additive Residuals CVPR 2023 -
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training CVPR 2023 Code
LaCLIP: Improving CLIP Training with Language Rewrites NeurIPS 2023 Code
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption ICCV 2023 Code
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training ICCV 2023 -
CLIPpy: Perceptual Grouping in Contrastive Vision-Language Models ICCV 2023 -
Efficient Vision-Language Pre-training by Cluster Masking CVPR 2024 Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era CVPR 2024 Code
Iterated Learning Improves Compositionality in Large Vision-Language Models CVPR 2024 -
FairCLIP: Harnessing Fairness in Vision-Language Learning CVPR 2024 Code
Retrieval-Enhanced Contrastive Vision-Text Models ICLR 2024 -

Pre-training with Generative Objective

Paper Published in Code/Project
FLAVA: A Foundational Language And Vision Alignment Model CVPR 2022 Code
CoCa: Contrastive Captioners are Image-Text Foundation Models arXiv 2022 Code
Too Large; Data Reduction for Vision-Language Pre-Training arXiv 2023 Code
SAM: Segment Anything arXiv 2023 Code
SEEM: Segment Everything Everywhere All at Once arXiv 2023 Code
Semantic-SAM: Segment and Recognize Anything at Any Granularity arXiv 2023 Code
Generative Region-Language Pretraining for Open-Ended Object Detection CVPR 2024 Code
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR 2024 Code
VILA: On Pre-training for Visual Language Models CVPR 2024 -
Enhancing Vision-Language Pre-training with Rich Supervisions CVPR 2024 -
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization ICLR 2024 Code
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning ICLR 2024 Code

Pre-training with Alignment Objective

Paper Published in Code/Project
GLIP: Grounded Language-Image Pre-training CVPR 2022 Code
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection NeurIPS 2022 -
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training CVPR 2023 Code
Do Vision and Language Encoders Represent the World Similarly? CVPR 2024 Code
Non-autoregressive Sequence-to-Sequence Vision-Language Models CVPR 2024 -
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback CVPR 2024 Code
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness arXiv 2024 Code

Vision-Language Model Transfer Learning Methods

Transfer with Prompt Tuning

Transfer with Text Prompt Tuning

Paper Published in Code/Project
CoOp: Learning to Prompt for Vision-Language Models IJCV 2022 Code
CoCoOp: Conditional Prompt Learning for Vision-Language Models CVPR 2022 Code
ProDA: Prompt Distribution Learning CVPR 2022 -
DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting CVPR 2022 Code
TPT: Test-time prompt tuning for zero-shot generalization in vision-language models NeurIPS 2022 Code
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations NeurIPS 2022 Code
CPL: Counterfactual Prompt Learning for Vision and Language Models EMNLP 2022 Code
Bayesian Prompt Learning for Image-Language Model Generalization arXiv 2022 -
UPL: Unsupervised Prompt Learning for Vision-Language Models arXiv 2022 Code
ProGrad: Prompt-aligned Gradient for Prompt Tuning arXiv 2022 Code
SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language Models arXiv 2022 Code
SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models TCSVT 2023 Code
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models CVPR 2023 Code
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition arXiv 2023 Code
Texts as Images in Prompt Tuning for Multi-Label Image Recognition CVPR 2023 code
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization CVPR 2023 Code
Learning to Name Classes for Vision and Language Models CVPR 2023 -
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models ICLR 2023 Code
CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification ICCV 2023 Code
ProTeCt: Prompt Tuning for Hierarchical Consistency arXiv 2023 -
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning arXiv 2023 Code
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? ICCV 2023 Code
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models ICCV 2023 -
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models ICCV 2023 -
Read-only Prompt Optimization for Vision-Language Few-shot Learning ICCV 2023 Code
Bayesian Prompt Learning for Image-Language Model Generalization ICCV 2023 Code
Distribution-Aware Prompt Tuning for Vision-Language Models ICCV 2023 Code
LPT: Long-Tailed Prompt Tuning For Image Classification ICCV 2023 Code
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning ICCV 2023 Code
Efficient Test-Time Prompt Tuning for Vision-Language Models arXiv 2024 -
Text-driven Prompt Generation for Vision-Language Models in Federated Learning ICLR 2024 -
C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion ICLR 2024 -
Prompt Gradient Projection for Continual Learning ICLR 2024 -
Nemesis: Normalizing the soft-prompt vectors of vision-language models ICLR 2024 Code
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning ICLR 2024 Code
TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model CVPR 2024 Code
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models CVPR 2024 Code
Any-Shift Prompting for Generalization over Distributions CVPR 2024 -
Towards Better Vision-Inspired Vision-Language Models CVPR 2024 -
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models ECCV 2024 Code

Transfer with Visual Prompt Tuning

Paper Published in Code/Project
Exploring Visual Prompts for Adapting Large-Scale Models arXiv 2022 Code
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification arXiv 2023 -
Fine-Grained Visual Prompting arXiv 2023 -
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models ICCV 2023 Code
Progressive Visual Prompt Learning with Contrastive Feature Re-formation IJCV 2024 Code
Visual In-Context Prompting CVPR 2024 Code
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance ECCV 2024 Code

Transfer with Text and Visual Prompt Tuning

Paper Published in Code/Project
UPT: Unified Vision and Language Prompt Learning arXiv 2022 Code
MVLPT: Multitask Vision-Language Prompt Tuning arXiv 2022 Code
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model arXiv 2022 Code
MaPLe: Multi-modal Prompt Learning CVPR 2023 Code
Learning to Prompt Segment Anything Models arXiv 2024 -
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts ECCV 2024 Code
An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models ICLR 2024 -
GalLoP: Learning Global and Local Prompts for Vision-Language Models ECCV 2024 -
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts ECCV 2024 Code

Transfer with Feature Adapter

Paper Published in Code/Project
Clip-Adapter: Better Vision-Language Models with Feature Adapters arXiv 2021 Code
Tip-Adapte: Training-free Adaption of CLIP for Few-shot Classification ECCV 2022 Code
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models BMVC 2022 Code
CLIPPR: Improving Zero-Shot Models with Label Distribution Priors arXiv 2022 Code
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification arXiv 2022 -
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models ICCV 2023 Code
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control ICCV 2023 Code
SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More arXiv 2023 Code
Segment Anything in High Quality arXiv 2023 Code
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding arXiv 2023 Code
CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models arXiv 2023 -
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation NeurIPS 2024 Code

Transfer with Other Methods

Paper Published in Code/Project
VT-Clip: Enhancing Vision-Language Models with Visual-guided Texts arXiv 2021 -
Wise-FT: Robust fine-tuning of zero-shot models CVPR 2022 Code
MaskCLIP: Extract Free Dense Labels from CLIP ECCV 2022 Code
MUST: Masked Unsupervised Self-training for Label-free Image Classification ICLR 2023 Code
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention AAAI 2023 Code
Semantic Prompt for Few-Shot Image Recognition CVPR 2023 -
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners CVPR 2023 Code
Task Residual for Tuning Vision-Language Models CVPR 2023 Code
Deeply Coupled Cross-Modal Prompt Learning ACL 2023 Code
Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation arXiv 2023 -
Personalize Segment Anything Model with One Shot arXiv 2023 Code
Chils: Zero-shot image classification with hierarchical label sets ICML 2023 Code
Improving Zero-shot Generalization and Robustness of Multi-modal Models CVPR 2023 Code
Exploiting Category Names for Few-Shot Classification with Vision-Language Models ICLR W 2023 -
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models arXiv 2023 Code
Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models ICCV 2023 Code
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization ICCV 2023 Code
PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation ICCV 2023 -
Black Box Few-Shot Adaptation for Vision-Language models ICCV 2023 Code
AD-CLIP: Adapting Domains in Prompt Space Using CLIP ICCVW 2023 -
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning arXiv 2023 Code
Language Models as Black-Box Optimizers for Vision-Language Models arXiv 2023 -
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching ICLR 2024 Code
Consistency-guided Prompt Learning for Vision-Language Models ICLR 2024 -
Efficient Test-Time Adaptation of Vision-Language Models CVPR 2024 Code
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models CVPR 2024 Code
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models CVPR 2024 Code
Anchor-based Robust Finetuning of Vision-Language Models CVPR 2024
Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners CVPR 2024 Code

Vision-Language Model Knowledge Distillation Methods

Knowledge Distillation for Object Detection

Paper Published in Code/Project
ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation ICLR 2022 Code
DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model CVPR 2022 Code
XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling CVPR 2022 Code
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection NeurIPS 2022 Code
PromptDet: Towards Open-vocabulary Detection using Uncurated Images ECCV 2022 Code
PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box Labels ECCV 2022 Code
OV-DETR: Open-Vocabulary DETR with Conditional Matching ECCV 2022 Code
Detic: Detecting Twenty-thousand Classes using Image-level Supervision ECCV 2022 Code
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers ECCV 2022 Code
VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object Detection ECCV 2022 Code
ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding Alignment arXiv 2022 Code
HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation arXiv 2022 Code
VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection ICLR 2023 Code
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models ICLR 2023 Code
CondHead: Learning to Detect and Segment for Open Vocabulary Object Detection CVPR 2023 -
Aligning Bag of Regions for Open-Vocabulary Object Detection CVPR 2023 Code
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers CVPR 2023 Code
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection CVPR 2023 Code
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching CVPR 2023 Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment CVPR 2023 -
Detecting Everything in the Open World: Towards Universal Object Detection CVPR 2023 Code
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining CVPR 2023 -
Contextual Object Detection with Multimodal Large Language Models arXiv 2023 Code
Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models arXiv 2023 Code
EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment ICCV 2023 Code
Improving Pseudo Labels for Open-Vocabulary Object Detection arXiv 2023 -
RegionGPT: Towards Region Understanding Vision Language Model CVPR 2024 Code
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors ICLR 2024 -
Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction ICLR 2024 -

Knowledge Distillation for Semantic Segmentation

Paper Published in Code/Project
SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples arXiv 2021 -
ReCo: Retrieve and Co-segment for Zero-shot Transfer NeurIPS 2022 Code
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation CVPR 2022 Code
CLIPSeg: Image Segmentation Using Text and Image Prompts CVPR 2022 Code
ZegFormer: Decoupling Zero-Shot Semantic Segmentation CVPR 2022 Code
LSeg: Language-driven Semantic Segmentation ICLR 2022 Code
ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model ECCV 2022 Code
OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels ECCV 2022 Code
Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models BMVC 2022 Code
OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP CVPR 2023 Code
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation CVPR 2023 Code
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation CVPR 2023 Code
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation CVPR 2023 Code
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations CVPR 2023 Code
Exploring Open-Vocabulary Semantic Segmentation without Human Labels arXiv 2023 -
OpenVIS: Open-vocabulary Video Instance Segmentation arXiv 2023 -
Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation arXiv 2023 -
Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation arXiv 2023 Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models arXiv 2023 -
SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning ICCV 2023 Code
ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation arXiv 2023 -
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP arXiv 2023 Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models arXiv 2023 -
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction ICLR 2024 -

Knowledge Distillation for Other Tasks

Paper Published in Code/Project
Controlling Vision-Language Models for Universal Image Restoration arXiv 2023 Code
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition ICLR 2024 Project
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection ICLR 2024 Code