-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability with HyperProp #2
Comments
I am not sure. I don't remember if I really tested LaProp based optimizer all that much. Is there any instability with other optimizers that use hypergradient descent? There could still be some bug in the repo too. Most of them are mix mash of multiple techniques, it may be also possible that some of these combinations simply don't work out, or even lead to some mathematical issues that I have missed; so if you aren't using something standard there may be issues with the combination idea itself. As for me I think I mostly did some sanity runs on hypergrad in first few epochs on a more basic optimizer. It didn't seem like there was anything obviously wrong from the changes in the lr, but it's hard to say just from that. My implementation of hypergrad is also a bit different from the original papers. Not sure if that causes any instability. |
I tested HyperProp on 4 tasks, first two are fine (nvidia waveglow and resnet image classification), one crush then I use weight decay param (if I remember case right, maybe it just crush) (a fork of nvidia tacotron2) and the last task is transformer image gan. I tested also HDQHSGDW on the last task and it seems to be fine. What optimizer you would recommend as default nobrain drop-in replacement? |
It depends on what you are trying to do. Research? Research on optimizers, specifically? Just applications? Or simply playing around with optimizers? I haven't personally tried the optimizers exhaustively, so I can't really tell which synergistic combination is good to go. The whole hyperxxx series optimizers may not be even tried by anyone; especially given my implementation has differences than original hypergrad descent, plus, I think also added some extra gradient computation to handle weight decay and other things. Generally, it's probably just better to stick with AdamW. This repo is mostly experimental and more for playing around. This should recover AdaMod + lookahead
Hypergradient descent is interesting conceptually. You can try and compare it. The original paper says even not-much-tuned hypergradient descent on lr is better than tuning lr in Adam and so on. But I don't think it was extensively tested on multiple tasks. But the gradients for hypergradient descent gets more complicated with AdaMod I think, and there could be some bugs in my gradient computation implementations in that case; and it's probably slow too. If you want to try that you probably have to see: HyperRangerMod |
Production/hobby. |
Hmm...not sure. |
I had similar experience in my application: speech enhancement. All optimizers gave much worse performance except DemonRanger, which matched performance with my default optimizers (Adam, RAdam, Ranger, AdamW). Although I did not play around with parameters of optimizers much. |
@saurabh-kataria @hadaev8 also when you say all optimizers gave much worse performance, did you mean even the recovered AdamW optimizer (from Readme) from this repo is giving worse performance than, say, your default AdamW? |
For now, I only tried all optimizers with 2-3 features disabled. |
@hadaev8 depends on what you mean "not same". On the surface, my implementation is a more general optimizer from which AdamW can be recovered with specific hyperparameters. So from that perspective, it should be of course different from vanilla pytorch. However, if you notice some significant difference in the core AdamW-related operations within the source code let me know. There shouldn't be any intended difference in the operations in the semantic level once the same hyperparameters are used. Though, I don't remember if I ever referred to the vanilla pytorch AdamW. I think I looked at some other repo for reference. |
You use add while vanilla use mul. |
@hadaev8
|
I meant I compared with pytorch internal implementations of Adam, AdamW, etc. For Ranger, I meant comparison with: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer |
@JRC1995
Also |
@hadaev8 I have actually used AdaMod and QHMomentum together in a project without such issues. The code was a bit different (it was based on the codes here so shouldn't be really different):
with config:
and everything else being the default grad norm is applied outside the optimizer as:
Though in this case grad_norm clipping may be the "magic ingredient" (not sure if I would have or did face any issues without it). It was a Transformer model with a few differences than the original, the most significant one (in terms of changing the training dynamics) was probably the use of ReZero. |
Yes, the value of grad norm wouldn't probably break or make things; I was just wondering absence or presence of the grad norm clipping itself may be a factor, but it seems like it isn't since we were both using it. I think in my case I am using gc sort of indiscriminately, but originally it was used only for convnets....not sure...forgot about it. I also had a bad time with gc when using ReZero, because I was using scalar parameter, and when using gc, the gradients of the scalar parameter just becomes zero -- so scalar parameters are never learned. Thus the way you initialize the parameters also play a role (for example, even if I use the parameters same as before for ReZero, but say, initialize the scalar ReZero alpha parameters for multiple layers in a batch (in a vector) ...gc would not turn it into 0). I think at the very least I updated the code after that to disable gc for scalar parameters, but there can still be other things that the implementation is not too careful about. |
Condition grad.dim() > 1 should cover rezero, right? |
QHMomentum + gc also raised nan |
Yes. I already have a condition, though it's more ugly; didn't know about dim.
Probably because that was the default for QHAdam. In the readme, you can see that I cange the beta1 when using Adam without QH Momentum. |
weight_decay + rectify + use_gc + amsgrad is ok |
So, every modification in adam seems to be ok except QH. I tried again HyperProp (all features disabled) and lr for one parameter seems to be too huge Also, with HDM=False, hypergrad_lr=1e-8 (should be default setup, right?) leraning rate do not changes. |
Yes you can try to track which parameter is experiencing that. It could be possible that hypergradient lr is not that stable in a more complex setting. IIRC, the original code was in a simpler settings without weight decay (and without considering gradients for that), and without proper per-parameter individual gradient change. Instead of Hyperprop I would actually try out HDQHSGDW or HyperRangerMod to play around with hypergrad lr. Also....actually if you add diffgrad I am not sure, but it can mean that the hypergradient will require different gradients (so you have to manually change the gradient maths inside the code)...sorry for about that. I think HyperProp has the highest chance of having a bug since I didn't try that out too much IIRC, and the gradient computation can have some mistake as a higher chance (at least for the other two, I think i could extend more easily from the pre-established hypergradient maths from the paper or repo -- though had to be extended to add AdaMod and weight decay), furthermore, new out-of-nowhere optimizers like LaProp also probably have higher chances of not being "good" --- there's a reason why most people still use adam\adamW -- because it has been consistently decent and sort of stood the test of time, even though there may even have been theoretical mistakes. Hyperprop is a sort of wild combination of a rarely used and tested technique (hypergradient lr) and some new optimizer out of nowhere (LaProp) which can easily lead to unstable or wierd stuff. To have a a more grounded experient with hypergradients I would that's why recommend HyperRangerMod (extends upon Adam/DemonRanger base but with hypergradient option) or HDQHSGDW (extends upon good ol' SGD-with-momentum --- which is also interesting because even now nothing almost really beats a "well tuned SGD or SGD with momentum". Adams and so on, are better for getting better results quickly, but simpler SGDs usually win out in the long run. But I guess, the problem is "well tuning" it which may take more work (Idk, I am not really an optimizer guy --- just made this repo when I was in the mood). In this case, hypergradient lr may have the potential to remedy it. It also add QHMomentum since it was shown to be a good addition to SGD with momentum too in the paper. It's also more minimanilistic in this sense. |
Don't really know what the "defaults" are...for LaProp there probably isn't any real official default. For Adam, I don't remember if there was a good recommended defualt. Probably it was something like what you posted, but you have to just check the hypergrad repo and/or the paper to better confirm. |
Yes, so far only one experiment beat vanilla adamw. It is radam + adamod + weight decay. I'm not using diffgrad for HyperProp, also turned off nostalgia, demon, gc, and nu. HDQHSGDW was very slow in the short term, also seems like nobody uses sgd for nlp tasks. Also maybe it's important, I use a dropout scheduler for the first 10% of the train. Maybe it adds instability. |
Yes, it seems AdaMod and RAdam would both reduce the initial variance and more. You can still probably try to use it, though there probably would not much of a benefit. Short-term measurement can be tricky. I think lookahead would tend to regularize more, which can lead to slower fitting --- though I guess if you are plotting validation performance, I can't say that adding regularization would necessarily decrease it in the short term. I have also tried lowering p for PAdam, but it seemed to make training appear worse in short term, but I suspect it makes thing better in the long run. But I didn't have patience with it, and I think I wasn't really finding much out of it even with longer runs. |
Btw, where is inconsistency, not all optimizers accept alpha=1. |
If you have alpha=1 you will essentially make lookahead redundant. A more principled way to disable lookahead is setting k=0 IIRC, that will deactivate all the lookahead computation. I think the initial range checking is a bit inconsistent across optimizers (some accepting alpha = 1 some not), but ultimately it doesn't matter a lot -- you can get away with k=0 whenever you want alpha=1. |
I mean i disabled lookahead with k=0 alpha=1 like in readme. |
fixed it. |
Didn't get this part. Could you elaborate what you mean by "lr change performed by sgd"? |
https://openreview.net/forum?id=BkrsAzWAb |
I think they just meant it loosely or in a more general sense. It's still based on per-batch statistics. The original paper itself applies hypergradient descent in Adam, and hyperranger also allows application of hypergradient on adam if that's what you were asking for. |
Just checked code and hypergradient for adam (and its variations in repo) uses adaptive moments for lr too. Nevermind then. |
Btw, I wanna experiment with lr dropout. |
Not sure. Doesn't look like cloning is necessary there. Though shouldn't probably hurt too much if it's there and unnecessary. |
I tried it in two tasks, but got nans during training, any suggestions?
The text was updated successfully, but these errors were encountered: