Remove the `optimizer_to_device` logic if possible #20165

awaelchli · 2024-08-05T21:25:49Z

Outline & Motivation

The trainer uses a function optimizer_to_device here:

pytorch-lightning/src/lightning/pytorch/strategies/strategy.py

Lines 160 to 161 in 631911c

    
           if trainer.state.fn == TrainerFn.FITTING: 
        
               _optimizers_to_device(self.optimizers, self.root_device)

In #19955 an issue was raised that the function moved the "step" parameter in the optimizer state to the CUDA device, causing device-to-host syncs during optimizer.step() because the "step" tensor was expected to remain on CPU. #20019 fixed this with special treatment of that key. However, good arguments were made in #19955 that this optimizer_to_device shouldn't even be necessary in the first place (#19955 (comment)).

Pitch

Remove optimizer_to_device and show that it is redundant by running the tests. We will still need a optimizer_to_cpu for teardown.

Additional context

No response

cc @justusschock @awaelchli @Borda

The text was updated successfully, but these errors were encountered:

corwinjoy · 2024-08-05T22:20:55Z

Thanks @awaelchli .
The analysis I have done on the code seems to agree that we can remove this function (see #19955 (comment)). The remaining points I think are of note:

This function is also called by Strategy.teardown(). The idea is to transfer the optimizer back from the GPU to the CPU. Maybe we don't really care about this since when the fit is complete either the optimizer will be checkpointed or discarded. The questions is if users really depend on this final transfer behavior but I would guess not. This factors into whether we feel we need optimizer_to_cpu for teardown as mentioned above.
I created additional tests in Test optimizer to device #20062 that I think are helpful in verifying that the function is not needed. I believe that some of these may be helpful in a PR to remove this function.

corwinjoy · 2024-08-05T22:22:06Z

Also tagging @janeyx99 for input.

janeyx99 · 2024-08-12T18:56:59Z

I am likely not the right person to respond to this q, but my instinct agrees with you and doesn't think moving the optimizer back to CPU needs to be a part of the teardown. I'd defer to others on this though.
I reviewed the part of the tests that move the state_dict back and forth. Left my comments on the PR!

awaelchli added refactor needs triage Waiting to be triaged by maintainers performance checkpointing Related to checkpointing and removed needs triage Waiting to be triaged by maintainers labels Aug 5, 2024

awaelchli mentioned this issue Aug 5, 2024

Adam optimizer is slower after loading model from checkpoint #19955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the `optimizer_to_device` logic if possible #20165

Remove the `optimizer_to_device` logic if possible #20165

awaelchli commented Aug 5, 2024 •

edited by github-actions bot

Loading

corwinjoy commented Aug 5, 2024 •

edited

Loading

corwinjoy commented Aug 5, 2024 •

edited

Loading

janeyx99 commented Aug 12, 2024

Remove the optimizer_to_device logic if possible #20165

Remove the optimizer_to_device logic if possible #20165

Comments

awaelchli commented Aug 5, 2024 • edited by github-actions bot Loading

Outline & Motivation

Pitch

Additional context

corwinjoy commented Aug 5, 2024 • edited Loading

corwinjoy commented Aug 5, 2024 • edited Loading

janeyx99 commented Aug 12, 2024

Remove the `optimizer_to_device` logic if possible #20165

Remove the `optimizer_to_device` logic if possible #20165

awaelchli commented Aug 5, 2024 •

edited by github-actions bot

Loading

corwinjoy commented Aug 5, 2024 •

edited

Loading

corwinjoy commented Aug 5, 2024 •

edited

Loading