Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean instead of sum when computing the expected_reward by episode #48

Open
sylvainma opened this issue Apr 23, 2020 · 1 comment
Open

Comments

@sylvainma
Copy link

sylvainma commented Apr 23, 2020

Hi,
According to most of PyTorch REINFORCE algorithm implementations, the policy gradient loss should sum the log_probs on the trajectory (sum over t=1...T) instead of computing the mean. In the paper, this is correctly summed in equations 8/9/10. The only mean is over the N episodes. I believe this is a mistake in the code only.

expected_reward = log_probs.mean() * (reward - baselines[key])

Should be

expected_reward = log_probs.sum() * (reward - baselines[key]) 

The assumption is that the authors wanted to average instead of summing because videos have a different length.

Please, tell me if I am wrong. Thanks!

@sylvainma
Copy link
Author

sylvainma commented Apr 23, 2020

To set a bit of context, REINFORCE implementations usually compute a loss L, so that once differentiated with autograd it matches the theoretical policy gradient of J(theta).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant