Refactor: Addressing Sources of User Error #73

thomasfortin1 · 2024-05-06T02:48:29Z

I made two changes which should help future users implement MuP correctly:

Previously a user could use mup.Adam, mup.AdamW, or mup.SGD (which are just the regular PyTorch optimizers) instead of the correct mup.MuAdam, mup.MuAdamW, or mup.MuSGD. Now the vanilla PyTorch optimizers cannot be accidentally accessed through the mup package.

If mup.MuAdam is used with weight decay, a warning will prompt the user to switch to mup.MuAdamW for correct weight decay scaling as described in appendix B.3 of the version of the paper which is on ArXiv. Note that doing a coord check will not indicate an incorrect implementation when using MuAdam with weight decay, but increasing model size will still eventually lead to diminishing performance unless MuAdamW is used instead (in my experience).

thomasfortin1 · 2024-05-06T02:50:52Z

@microsoft-github-policy-service agree

thomasfortin1 added 2 commits May 5, 2024 22:15

removed mup.Adam, mup.AdamW, and mup.SGD from package

23245b8

added warning for using weight decay with MuAdam rather than MuAdamW

f169261

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Addressing Sources of User Error #73

Refactor: Addressing Sources of User Error #73

thomasfortin1 commented May 6, 2024

thomasfortin1 commented May 6, 2024

Refactor: Addressing Sources of User Error #73

Are you sure you want to change the base?

Refactor: Addressing Sources of User Error #73

Conversation

thomasfortin1 commented May 6, 2024

thomasfortin1 commented May 6, 2024