You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
According to most of PyTorch REINFORCE algorithm implementations, the policy gradient loss should sum the log_probs on the trajectory (sum over t=1...T) instead of computing the mean. In the paper, this is correctly summed in equations 8/9/10. The only mean is over the N episodes. I believe this is a mistake in the code only.
To set a bit of context, REINFORCE implementations usually compute a loss L, so that once differentiated with autograd it matches the theoretical policy gradient of J(theta).
Hi,
According to most of PyTorch REINFORCE algorithm implementations, the policy gradient loss should sum the
log_probs
on the trajectory (sum over t=1...T) instead of computing the mean. In the paper, this is correctly summed in equations 8/9/10. The only mean is over the N episodes. I believe this is a mistake in the code only.pytorch-vsumm-reinforce/main.py
Line 131 in fdd03be
Should be
The assumption is that the authors wanted to average instead of summing because videos have a different length.
Please, tell me if I am wrong. Thanks!
The text was updated successfully, but these errors were encountered: