For my project I used the Microsoft Git Large model trained on the coco image dataset [4][5]. I found that this one was relatively simple to implement and work with. Fine tuning the model took the most time, I had to experiment with the attention mask, learning rate, and batch sizes to finally get a model that performs well. I ended up finding a nice parameter set that got me a CIDEr score of ~75 after only 1 epoch. I had fun learning about hugging face and implementation of deep learning models!
The model is completely contained within demo/train.py and demo/test.py but most of my experiments and work were done within experiments.ipynb
- CIDEr: Consensus-based image description evaluation
- BLEU: A Misunderstood Metric from Another Age, Medium Post
- BLEU Metric, HuggingFace space
- Microsoft Git Large
- GIT: A Generative Image-to-text Transformer for Vision and Language, Jianfeng Wang and Zhengyuan Yang and Xiaowei Hu and Linjie Li and Kevin Lin and Zhe Gan and Zicheng Liu and Ce Liu and Lijuan Wang (2022)