1 intermediate2 advancedAdapts to your performance
1 / 3
intermediate (6/10)conceptual
In the REINFORCE algorithm, the policy gradient is ∇θJ=E[∑t∇θlogπθ(at∣st)Gt]. Why is subtracting a baseline b(st) from Gt useful even though it does not change the expected gradient?