Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

This is the source code of our ACM MM 2023 paper "Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval".

Installation

conda env create -f environment.yml
source activate Real

Dataset

We release the entire section of the Real400K dataset and conducted experimental comparisons for our paper on this dataset. To facilitate downloading of this dataset, we provide a Baidu Netdisk download link. Please note:

Regarding dataset download, please sign the Release Agreement and send it to Yanzhe Chen. By sending the application, you are agreeing and acknowledging that you have read and understand the notice. We will reply with the file and the corresponding guidelines right after we receive your request!
The dataset is large in scale and requires approximately 136G of storage consumption.

The organization format of this dataset is as follows, please pay attention to the correspondence between goods images, video frames, and their related text.

Dataset/
├─ Real20M|Real400K/
│  ├─ query/
│  ├─ goods/
│  │  ├─ images
│  │  ├─ text
│  ├─ video/
│  │  ├─ images
│  │  ├─ text
├─ train_file/
├─ test_file/
├─ checkpoints/

Quick Start

Dataset: Contains the data, split files and ckpts.
datasets: Contains loading files for the dataset.
evaluate: Consisting of evaluation scripts for rapid retrieval, ≈ 20 minutes to run on a V100.
losses: Includes the loss functions used in this project as well as other commonly used loss functions.
models: Includes the models used in this project as well as related models that may be compared.
utils: Comprises of utility functions that support various tasks within the project.
evaluation.py: Evaluation code, due to the large dataset size, features are stored by writing to files.
main_cross_domain_emb.py: Entry for training and testing, with the logic in main() as follows.
- Basic settings
- Initialize models
- Optionally resume from a checkpoint
- Data loading
- TESTING (Exit after running the test code if args.evaluate==True)
- Initialize losses, optimizers, and grad_amp
- TRAINING LOOP
- Terminate the DDP process

Please note:

Please note to complete the path at the beginning of the following script files.
The training code is built on PyTorch with DistributedDataParallel (DDP).
We pretrain the framework on 2 nodes, each with 8 V100 GPUs (10 epochs in about two days).

# Train the query-guided cross-domain retrieval framework.
sh train.sh

# Evaluate on the Video2goods task.
sh video2goods_evaluate.sh

# Evaluate on the Goods2video task.
sh goods2video_evaluate.sh

Model Wights

Due to the restriction imposed by Kuaishou on code sharing, which prevents us from making the pre-training framework code public.

However, we will open-source the model weights and provide a link to access them in the Baidu Netdisk download link.

Please download and put the checkpoints under: outputs/checkpoints/, pretrain.pth.tar is the pre-trained model, while checkpoint.pth.tar is the model that achieves the SOTA results.

Citation

If you find our work helps, please cite our paper.

@inproceedings{chen2023real20m,
  title={Real20M: A large-scale e-commerce dataset for cross-domain retrieval},
  author={Chen, Yanzhe and Zhong, Huasong and He, Xiangteng and Peng, Yuxin and Cheng, Lele},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={4939--4948},
  year={2023}
}

Contact

This repo is maintained by Yanzhe Chen. Questions and discussions are welcome via [email protected].

Acknowledgements

Our codes reference the following projects. Many thanks to the authors!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Installation

Dataset

Quick Start

Model Wights

Citation

Contact

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
datasets		datasets
evaluate		evaluate
imgs		imgs
losses		losses
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md
Release_Agreement.pdf		Release_Agreement.pdf
environment.yml		environment.yml
evaluation.py		evaluation.py
goods2video_evaluate.sh		goods2video_evaluate.sh
main_cross_domain_emb.py		main_cross_domain_emb.py
notice.pdf		notice.pdf
train_run.sh		train_run.sh
video2goods_evaluate.sh		video2goods_evaluate.sh

License

ChenAnno/Real20M_ACMMM2023

Folders and files

Latest commit

History

Repository files navigation

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Installation

Dataset

Quick Start

Model Wights

Citation

Contact

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages