Is Expected Sarsa is off-policy, and SARSA is just an MC estimate of Expected SARSA, why is it on-policy?

Question

So, expected SARSA defines the update as: $$ Q(s,a) = Q(s,a) +\alpha (R+ \mathbb{E}_{a\sim\pi(s')}[Q(s', a)] - Q(s,a)) $$ Where SARSA defines the update as $a'\sim\pi(s')$: $$ Q(s,a) = Q(s,a) +\alpha (R+ Q(s', a') - Q(s,a)) $$

So how is SARSA not just a MC estimate of ExpSARSA? and since MC is unbiased, why not should SARSA also be an off-policy algorithm?

Edit: since seems like it's that clear what is the MC estimate I'm referring to, for some reason, my question is:

the two update differ only by an expected value, and $Q(s', a'\sim\pi(s')) \approx \mathbb{E}_{a\sim\pi(s')}[Q(s', a)]$, since it is a one sample monte carlo estimate, which is an unbiased estimate, so there is no reason why one should be on policy and the other off policy, since in expectation they lead to the same update

cinch · Accepted Answer · 2024-11-08 05:37:50Z

0

If you view SARSA "effectively" approximates the ExpSARSA by sampling a single action at each update step, making it a one sample MC estimate of the expected Q-values throughout only one episode, SARSA remains on-policy because its later updates in the same episode are caused by a specific choice of action sampled from its current $\epsilon$-greedy policy reflecting its all previous specific chosen actions' cumulative contributions. Further SARSA's later episode's behavior policy will be influenced by previous episodes via the feedback loop between policy and Q-values.

In contrast, in MC methods the policy is typically fixed within every episode and waits until the entire episode is completed to update the focused Q-values of a certain state-action pair based on empirical returns starting from the said state-action chain to the end of the episode. Therefore SARSA cannot be simply viewed as one sample MC estimate of ExpSARSA.

edited Nov 8, 2024 at 5:37

answered Nov 7, 2024 at 6:48

cinch

3,1331 gold badge5 silver badges15 bronze badges

$\begingroup$ i’m not talking about MC method, i’m saying that sarda is a one sample MC estimate of the exp-sarsa update $\endgroup$

Alberto
– Alberto

2024-11-07 07:21:48 +00:00
Commented Nov 7, 2024 at 7:21
1

$\begingroup$ @Alberto Ok, but please note your question is exactly "So how is SARSA not just a MC estimate of ExpSARSA?", not something like one sample of MC estimate. If that's the case you imagine a MC method where each independent episode is ExpSARS or you view ExSARSA as an MC estimate of SARSA or something else? It could be very confusing here due to multiple combination possibilities without more clarification. $\endgroup$

cinch
– cinch

2024-11-07 07:40:32 +00:00
Commented Nov 7, 2024 at 7:40
$\begingroup$ "one sample MC" and "an MC estimate" are the same thing, you are estimating an expected value with a single point from such distribution $\endgroup$

Alberto
– Alberto

2024-11-07 09:04:30 +00:00
Commented Nov 7, 2024 at 9:04
$\begingroup$ @Alberto I've updated my answer per your update, hope it clarifies. $\endgroup$

cinch
– cinch

2024-11-08 05:38:56 +00:00
Commented Nov 8, 2024 at 5:38
$\begingroup$ yes, but my point is: it’s not the update that makes SARSA on policy, but rather the fact that you use the sane policy for exploring $\endgroup$

Alberto
– Alberto

2024-11-08 11:58:38 +00:00
Commented Nov 8, 2024 at 11:58

| Show 2 more comments

Stack Exchange Network

Is Expected Sarsa is off-policy, and SARSA is just an MC estimate of Expected SARSA, why is it on-policy?

1 Answer 1

Your Answer

Hot Network Questions

Is Expected Sarsa is off-policy, and SARSA is just an MC estimate of Expected SARSA, why is it on-policy?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions