Possible Ray OOM when procesing long documents #568

versae · 2024-04-30T09:31:17Z

The workers OOMs a few times during the tokenization of a dataset with very long documents (over 1M chars), but succeed in the end by adjusting batch size of BatchTokenizer and just retrying.

@dlwh:

yeah so i think what's happening is that ray creates 1 process per cpu on the node (even though we always schedule ~16 or so cpus per tokenization task), reusing those processes to process batches, Ray seems to do some kind of round-robin scheduling to these processes.
This is fine and good, except HF tokenizers retains memory somehow in those processes (probably as an optimization?), and memory use seems to be directly related to the doc sizes. This means that we're retaining num_processes * whatever ram it is, and this ooms on TPU for large enough books
if ray would just... reuse processes or not allocate so many processes it would be fine

Could this be the reason?

levanter/src/levanter/data/text.py

Lines 308 to 312 in 2516d06

    
           def _maybe_force_tokenizer_parallelism(tokenizer: PreTrainedTokenizerBase): 
        
               if tokenizer.is_fast and os.getenv("TOKENIZERS_PARALLELISM") is None: 
        
                   # if we're using a fast tokenizer, we want to force parallelism 
        
                   # to be the number of CPUs 
        
                   os.environ["TOKENIZERS_PARALLELISM"] = "true"

It seems it will in most cases enable multithreading in Rust.

The text was updated successfully, but these errors were encountered:

dlwh · 2024-05-13T20:00:28Z

@versae have you tried disabling and seeing if it fixes?

versae · 2024-05-13T20:06:59Z

Yes, I now set TOKENIZERS_PARALLELISM to false in my setup scripts. It seems to help, but not sure it is the definitive fix.

dlwh · 2024-05-13T20:09:51Z

interesting ok, I guess it's time to give up on that then. Do you reduce the batch size?

versae · 2024-05-15T12:11:45Z

Yes, for processing very very long documents (tens of millions of tokens) I had to set it to 1 and set TOKENIZERS_PARALLELISM to False. Slower, but at least it hasn't failed me yet. Is the batch size of the tokenizer something we can set in the config for the training?

versae · 2024-05-16T10:59:46Z

OK, I think I found a winning combination, setting SLURM_CPUS_ON_NODE=16 TOKENIZERS_PARALLELISM=false seems to work with the current batch size. On a TPUv4-32, 3 out of 4 nodes sometimes fail right after loading the weights, but the 1 that keeps running is able to finish the tokenization. So I just leave it running and when it's done I restart training without SLURM_CPUS_ON_NODE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Ray OOM when procesing long documents #568

Possible Ray OOM when procesing long documents #568

versae commented Apr 30, 2024

dlwh commented May 13, 2024

versae commented May 13, 2024

dlwh commented May 13, 2024

versae commented May 15, 2024 •

edited

Loading

versae commented May 16, 2024

Possible Ray OOM when procesing long documents #568

Possible Ray OOM when procesing long documents #568

Comments

versae commented Apr 30, 2024

dlwh commented May 13, 2024

versae commented May 13, 2024

dlwh commented May 13, 2024

versae commented May 15, 2024 • edited Loading

versae commented May 16, 2024

versae commented May 15, 2024 •

edited

Loading