-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Issue in training VMAS simple spread with MAPPO and IPPO #144
Comments
Hello, thanks for opening this as I think this is a point that many users face and will benefit an answer to! The short answer is that tuning RL algorithms is a bit like dark magic and especially with PPO there are these issues of catastrophic forgetting (the second one you mention) and gradients exploding (the first one you mention). These issues do not normally pop up for me with the vmas fine_tuned hyperparams, but might occur in specific reward structures/ scenarios. What normally happens with NaN actions is that the loss explodes, gradients explode, and weights go to nan in backpropagation, giving birth to nan actions. Here is some information on how to tackle this I gathered throughout the years (you are not the first user that encountered this, but I never made a public list):
Maybe try out some of these. but if the issue keeps happening with the default vmas/simple_spread and the default fine_tuned vmas params I can have a further look Also, here are some points from @vmoens from the discord channel on the utilitiy of the loss metrics (maybe not relevant to your case but good to have)
|
Hello, I'm trying to train a policy for VMAS simple spread environment using MAPPO and IPPO in Benchmarl.
However, I'm suffering with some issues while training and it would be great if I can get any insight from you.
First, I tried MAPPO, using basic configuration given in BenchMARL. When I train, rewards and loss objectives are well behaved.
However, at some point, it starts to output nan actions and I think the reason is because gradient of loss objective and clip friction goes beyond certain threshold that MAPPO can handle.
On the other hand, when I train IPPO, mean reward starts to decrease at some point of the training.
While I think this question belongs to general algorithm questions, I would appreciate if you can give any insight on why this happens and how to avoid this.
Thank you
The text was updated successfully, but these errors were encountered: