Skip to main content
Theorem
Path
Curriculum
Paths
Labs
Diagnostic
Case Study
Blog
Search
Sign in
Quiz Hub
/
Actor-Critic Methods
Actor-Critic Methods
2 selected
Difficulty 5-6
2 unseen
View topic
Intermediate
New
0 answered
2 intermediate
Adapts to your performance
Question 1 of 2
120s
intermediate (5/10)
conceptual
Actor-critic methods use the advantage function
A
(
s
,
a
)
=
Q
(
s
,
a
)
−
V
(
s
)
in the policy gradient
∇
θ
J
=
E
[
∇
θ
lo
g
π
θ
(
a
∣
s
)
⋅
A
(
s
,
a
)]
. Why subtract
V
(
s
)
from
Q
(
s
,
a
)
?
Hide and think first
A.
V
(
s
)
is a state-dependent baseline; subtracting it preserves the gradient's expectation while reducing variance
B.
Subtracting
V
(
s
)
converts on-policy to off-policy estimation; this is what makes actor-critic compatible with replay buffers
C.
Q
(
s
,
a
)
−
V
(
s
)
is the *exact* policy gradient for soft actor-critic; without it the SAC update is biased
D.
The advantage corrects for the time-discount factor
γ
, which would otherwise bias the policy gradient toward short-horizon rewards
Submit Answer
I don't know