HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment [Paper]
- Build running environment (two ways)
1. pip install --editable .
2. python setup.py build_ext --inplace
- Install the syntax parser
pip install Stanza 1.2.2 Stanza_batch 0.2.2
- pytorch==1.7.0, torchvision==0.8.0, cudatoolkit=10.1 (pip install is also work)
conda install pytorch==1.7.0 torchvision==0.8.0 cudatoolkit=10.1 -c pytorch
- Python 3.7.6
- Meteor-1.5, and its compiler Java JDK 1.8.0 (or higher)
The dataset used in this work is Multi30K, both its original and preprocessed versions (that I used) are available at here.
You can download your own data set and then refer to experiments/prepare-iwslt14.sh or experiments/prepare-wmt14en2de.sh to pre-process the data set.
File Name | Description | Download |
---|---|---|
resnet50-avgpool.npy |
pre-extracted image features, each image is represented as a 2048-dimensional vector. | Link |
Multi30K EN-DE Task |
BPE+TOK text, Image Index, Label for English-German task (including train, val, test2016/17/mscoco) | Link |
Multi30K EN-FR Task |
BPE+TOK text, Image Index, Label for English-French task (including train, val, test2016/17/mscoco) | Link |
You can let this code works by run the scripts in the directory expriments.
-
preprocess dataset into torch type
bash pre.sh
-
train model
bash train.sh
-
generate target sentence
bash gen.sh
If you use the code in your research, please cite:
@inproceedings{peng2022hybridvocab,
title={HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment},
author={Peng, Ru and Zeng, Yawen and Zhao, Junbo},
booktitle={Proceedings of the 2022 International Conference on Multimedia Retrieval},
pages={380--388},
year={2022}
}