Mean instead of sum when computing the `expected_reward` by episode #48

sylvainma · 2020-04-23T21:13:37Z

Hi,
According to most of PyTorch REINFORCE algorithm implementations, the policy gradient loss should sum the log_probs on the trajectory (sum over t=1...T) instead of computing the mean. In the paper, this is correctly summed in equations 8/9/10. The only mean is over the N episodes. I believe this is a mistake in the code only.

pytorch-vsumm-reinforce/main.py

Line 131 in fdd03be

expected_reward = log_probs.mean() * (reward - baselines[key])

Should be

expected_reward = log_probs.sum() * (reward - baselines[key])

The assumption is that the authors wanted to average instead of summing because videos have a different length.

Please, tell me if I am wrong. Thanks!

The text was updated successfully, but these errors were encountered:

sylvainma · 2020-04-23T22:01:53Z

To set a bit of context, REINFORCE implementations usually compute a loss L, so that once differentiated with autograd it matches the theoretical policy gradient of J(theta).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean instead of sum when computing the `expected_reward` by episode #48

Mean instead of sum when computing the `expected_reward` by episode #48

sylvainma commented Apr 23, 2020 •

edited

Loading

sylvainma commented Apr 23, 2020 •

edited

Loading

Mean instead of sum when computing the expected_reward by episode #48

Mean instead of sum when computing the expected_reward by episode #48

Comments

sylvainma commented Apr 23, 2020 • edited Loading

sylvainma commented Apr 23, 2020 • edited Loading

Mean instead of sum when computing the `expected_reward` by episode #48

Mean instead of sum when computing the `expected_reward` by episode #48

sylvainma commented Apr 23, 2020 •

edited

Loading

sylvainma commented Apr 23, 2020 •

edited

Loading