The code for image captioning using C3 is adapted from CapDec and refactored with Pytorch Lightning. Wandb was integrated for logging.
Download the MSCOCO dataset from here.
Create conda environment
conda env create -f environment.yml
Note: if using Imagebind, follow the official repo to create a separate imagebind conda environment.
- Preprocess COCO labels
python3 src/parse_data/create_labels_json.py
- Embed COCO dataset with CLIP and compute modality means
python3 src/parse_data/parse_coco.py
python3 src/parse_data/compute_embed_means.py
- (Optional) Embed COCO dataset with ImageBind and compute modality means
conda activate imagebind
python3 src/parse_data/parse_coco_imagebind.py
python3 src/parse_data/compute_embed_means_imagebind.py
conda deactivate imagebind
Training, model, logging and data configurations are provided in configs
.
Scripts to run all C3 experiments on COCO using CLIP and ImageBind are provided in scripts
. We provide the uni-modal text training (stage1
) and cross-modal image-to-text training (stage2
) scripts for CLIP in scripts/coco_scripts
and the uni-modal text training (stage1
) scripts for ImageBind in imagebind_coco_scripts
.
To run,
bash ./scripts/coco_scripts/stage1/train_unimodal_c3.sh