Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Training #30

Open
Forainest789 opened this issue Dec 12, 2024 · 0 comments
Open

About Training #30

Forainest789 opened this issue Dec 12, 2024 · 0 comments

Comments

@Forainest789
Copy link

Thanks for your work.
I have some questions related to training.
I tried to train the model with a small portion of the data, but when I tried to train using dataset online like:
https://huggingface.co/datasets/imageomics/TreeOfLife-10M/blob/main/dataset/EOL/image_set_01.tar.gz,
and download the dataset in local

python -m src.training.main \
  --train-data 'https://huggingface.co/datasets/imageomics/TreeOfLife-10M/resolve/main/dataset/EOL/image_set_01.tar.gz' \
  --val-data 'https://huggingface.co/datasets/imageomics/TreeOfLife-10M/resolve/main/dataset/EOL/image_set_01.tar.gz' \
  --dataset-type 'webdataset' \
  --pretrained 'openai' \
  --text_type 'random' \
  --warmup 100 \
  --batch-size 1 \
  --accum-freq 1 \
  --epochs 10 \
  --workers 1 \
  --model ViT-B-16 \
  --lr 1e-4 \
  --log-every-n-steps 1 \
  --dataset-resampled \
  --local-loss \
  --gather-with-grad \
  --grad-checkpointing \
  --logs '../storage/log/' \
  --train-num-samples 98000 \

it always gets stuck at the following position

2024-12-11,23:16:02 | INFO | wandb_notes:
2024-12-11,23:16:02 | INFO | wandb_project_name: open-clip
2024-12-11,23:16:02 | INFO | warmup: 100
2024-12-11,23:16:02 | INFO | wd: 0.2
2024-12-11,23:16:02 | INFO | workers: 1 
2024-12-11,23:16:02 | INFO | world_size: 1 
2024-12-11,23:16:02 | INFO | zeroshot_frequency: 2 
2024-12-11,23:16:02 | INFO | Finish counting shard total size: 98000. 
2024-12-11,23:16:02 | INFO | Finish counting shard total size: 0. 
2024-12-11,23:16:02 | INFO | Start epoch 0 
<webdataset.compat.WebLoader object at 0x719706e3a170>

In addition, I found the missing "data/resolved.jsonl" file when creating the data,

python scripts/evobio10m/make_metadata.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite

and the ToL-EDA HF Repo mentioned in the readme has disappeared

Can you provide me with some help to solve these problems
Or where can I find the details about training

Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant