[Question] Issue in training VMAS simple spread with MAPPO and IPPO #144

kh-ryu · 2024-11-18T23:38:22Z

Hello, I'm trying to train a policy for VMAS simple spread environment using MAPPO and IPPO in Benchmarl.
However, I'm suffering with some issues while training and it would be great if I can get any insight from you.

First, I tried MAPPO, using basic configuration given in BenchMARL. When I train, rewards and loss objectives are well behaved.

However, at some point, it starts to output nan actions and I think the reason is because gradient of loss objective and clip friction goes beyond certain threshold that MAPPO can handle.

On the other hand, when I train IPPO, mean reward starts to decrease at some point of the training.

While I think this question belongs to general algorithm questions, I would appreciate if you can give any insight on why this happens and how to avoid this.

Thank you

matteobettini · 2024-11-24T11:49:33Z

Hello, thanks for opening this as I think this is a point that many users face and will benefit an answer to!

The short answer is that tuning RL algorithms is a bit like dark magic and especially with PPO there are these issues of catastrophic forgetting (the second one you mention) and gradients exploding (the first one you mention).

These issues do not normally pop up for me with the vmas fine_tuned hyperparams, but might occur in specific reward structures/ scenarios.

What normally happens with NaN actions is that the loss explodes, gradients explode, and weights go to nan in backpropagation, giving birth to nan actions.

Here is some information on how to tackle this I gathered throughout the years (you are not the first user that encountered this, but I never made a public list):

the entropy coefficient might be conflicting with optimality. maybe you are pushing for a higher entropy than what is required by the task?
Have you tried using the vmas training parameters in the fine_tuned benchmarl folder? those work for me in the current vmas scenarios
highering the eps coefficient of adam

BenchMARL/benchmarl/conf/experiment/base_experiment.yaml

Line 24 in 9813807

adam_eps: 0.000001

could help here with stability
also watching the clip value for the gradient norm
logging at what point the norm of your gradients start to explode could be helpful, maybe it is related to your agent entering a novel reward regime
the scale of the MPE rewards is tuned quite badly, probably some reward tuning would help here (but I see that this might not make sense if you don't wnat to touch the original MPE task)
check if it happens also in pettingzoo/simple_spread task

Maybe try out some of these. but if the issue keeps happening with the default vmas/simple_spread and the default fine_tuned vmas params I can have a further look

Also, here are some points from @vmoens from the discord channel on the utilitiy of the loss metrics (maybe not relevant to your case but good to have)

In RL the concept of a loss that always decreases isn't that relevant.
Most of the time you don't have one but multiple losses (say one or more loss per model). These models are "competing" against each other. Let me take an example to make this clear:
Imagine you're training your value network to predict the long-term return of your policy.
Initially, your reward is null or close to 0. The value network predicts a small return, and the L2 distance between that prediction and the true return is small. (similarly, if you're in a household with low income you can predict with a good accuracy that you will have low savings at the end of the year)
Progressively, your policy becomes better, as a consequence the rewards have a higher magnitude, and so does the value that you're predicting. Now, the error of the value network is much bigger in magnitude, not because it's doing a worse job but because the landscape has changed. (taking the household example, if you're in a rich household the absolute value of the savings you can have at the end of the year has a much greater variance, not because you're worse at predicting it but because of the scale you're looking at).
That means that the loss of your critic is actually getting worse through time (ie higher, not lower).

TL;DR: in RL the absolute value of the loss doesn't make sense. you will never see a paper reporting that. The only thing people care about is whether your reward/returns get better
If I can add something here: in some cases (like PPO) the loss is even more pointless: in PPO the loss we compute is actually a proxy which gradient has the property of being on expectation equal to the gradient of some objective function that is intractable (an integral you can't solve analytically). In other words, for PPO the loss value is total garbage
My fav metrics to look at in RL are:

reward/return curve: this is your P0 value, the only thing that really matters - Track this with RewardSum

episode length (when it matters, eg CartPole) - Track this with StepCounter

gradient norms of the various models (very high gradients or unexpectedly low may indicate ill behaved learning curves. Drastic changes can be worrying too if not correlated with better perf)

exploding values in the loss values (per se the loss is meaningless but a value loss that goes to 10^38 probably means you're doing smth very wrong!)

I general, any sudden change and/or extreme value requires an explanation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Issue in training VMAS simple spread with MAPPO and IPPO #144

[Question] Issue in training VMAS simple spread with MAPPO and IPPO #144

kh-ryu commented Nov 18, 2024 •

edited

Loading

matteobettini commented Nov 24, 2024

[Question] Issue in training VMAS simple spread with MAPPO and IPPO #144

[Question] Issue in training VMAS simple spread with MAPPO and IPPO #144

Comments

kh-ryu commented Nov 18, 2024 • edited Loading

matteobettini commented Nov 24, 2024

kh-ryu commented Nov 18, 2024 •

edited

Loading