-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570
Comments
good to know. Probably we just need to set the paths more carefully? Our internal slurm cluster does have wandb so we don't test TB as thoroughly as we should there. Could you try just adding like
|
Adding process index to the path allows it to train without error with dir_to_write = os.path.join(dir_to_write, run_id, str(jax.process_index())) Not sure why the same error doesn't occur for multinode training with One solution could be
though I'm not sure if it screws with the other way of training ( |
@Lauler did you ever check on this? |
I didn't manage to get it to work with
It only results in 1 process writing to the run_id dir, and the rest of the processes trying to write directly to the origina Moving more of the logic of the method into the if-statement so the other processes are prevented from writing logs instead results in the I switched over to wandb because I realized it could be run in offline mode. Didn't spend any more time troubleshooting tensorboad, and have no need for it anymore. |
Did a quick test and it works when adding a process_index check to the CompositeTracker class. Though I have quite low degree of confidence that my quick hack doesn't break something else. There should be a cleaner solution. Changed the init of the
and CompositeTracker to
|
The documentation of Levanter recommends one process per GPU rather than one process per node when training on multiple GPUs. The untested script for launching a multi-node SLURM job also sets
--ntasks-per-node
to be equal to the number of GPUs per node.Training with 1 process per GPU using Tensorboard logging however leads to the run crashing because multiple processes are trying to write to the same logs (truncated traceback):
Not sure whether training with 1 process per GPU works with wandb, as our compute nodes do not have access to the internet (a fairly common scenario for compute nodes on HPC). For this reason we use tensorboard.
Training with 1 process per node works fine however.
Thought I'd open an issue in case anyone else wants to train with Levanter on GPUs and SLURM. Set your
--ntasks-per-node=1
and it should work.The text was updated successfully, but these errors were encountered: