Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Ray OOM when procesing long documents #568

Open
versae opened this issue Apr 30, 2024 · 5 comments
Open

Possible Ray OOM when procesing long documents #568

versae opened this issue Apr 30, 2024 · 5 comments

Comments

@versae
Copy link
Contributor

versae commented Apr 30, 2024

The workers OOMs a few times during the tokenization of a dataset with very long documents (over 1M chars), but succeed in the end by adjusting batch size of BatchTokenizer and just retrying.

@dlwh:

yeah so i think what's happening is that ray creates 1 process per cpu on the node (even though we always schedule ~16 or so cpus per tokenization task), reusing those processes to process batches, Ray seems to do some kind of round-robin scheduling to these processes.
This is fine and good, except HF tokenizers retains memory somehow in those processes (probably as an optimization?), and memory use seems to be directly related to the doc sizes. This means that we're retaining num_processes * whatever ram it is, and this ooms on TPU for large enough books
if ray would just... reuse processes or not allocate so many processes it would be fine

Could this be the reason?

def _maybe_force_tokenizer_parallelism(tokenizer: PreTrainedTokenizerBase):
if tokenizer.is_fast and os.getenv("TOKENIZERS_PARALLELISM") is None:
# if we're using a fast tokenizer, we want to force parallelism
# to be the number of CPUs
os.environ["TOKENIZERS_PARALLELISM"] = "true"

It seems it will in most cases enable multithreading in Rust.

@dlwh
Copy link
Member

dlwh commented May 13, 2024

@versae have you tried disabling and seeing if it fixes?

@versae
Copy link
Contributor Author

versae commented May 13, 2024

Yes, I now set TOKENIZERS_PARALLELISM to false in my setup scripts. It seems to help, but not sure it is the definitive fix.

@dlwh
Copy link
Member

dlwh commented May 13, 2024

interesting ok, I guess it's time to give up on that then. Do you reduce the batch size?

@versae
Copy link
Contributor Author

versae commented May 15, 2024

Yes, for processing very very long documents (tens of millions of tokens) I had to set it to 1 and set TOKENIZERS_PARALLELISM to False. Slower, but at least it hasn't failed me yet. Is the batch size of the tokenizer something we can set in the config for the training?

@versae
Copy link
Contributor Author

versae commented May 16, 2024

OK, I think I found a winning combination, setting SLURM_CPUS_ON_NODE=16 TOKENIZERS_PARALLELISM=false seems to work with the current batch size. On a TPUv4-32, 3 out of 4 nodes sometimes fail right after loading the weights, but the 1 that keeps running is able to finish the tokenization. So I just leave it running and when it's done I restart training without SLURM_CPUS_ON_NODE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants