-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Ray OOM when procesing long documents #568
Comments
@versae have you tried disabling and seeing if it fixes? |
Yes, I now set |
interesting ok, I guess it's time to give up on that then. Do you reduce the batch size? |
Yes, for processing very very long documents (tens of millions of tokens) I had to set it to 1 and set |
OK, I think I found a winning combination, setting |
The workers OOMs a few times during the tokenization of a dataset with very long documents (over 1M chars), but succeed in the end by adjusting batch size of
BatchTokenizer
and just retrying.@dlwh:
Could this be the reason?
levanter/src/levanter/data/text.py
Lines 308 to 312 in 2516d06
It seems it will in most cases enable multithreading in Rust.
The text was updated successfully, but these errors were encountered: