2
$\begingroup$

So, expected SARSA defines the update as: $$ Q(s,a) = Q(s,a) +\alpha (R+ \mathbb{E}_{a\sim\pi(s')}[Q(s', a)] - Q(s,a)) $$ Where SARSA defines the update as $a'\sim\pi(s')$: $$ Q(s,a) = Q(s,a) +\alpha (R+ Q(s', a') - Q(s,a)) $$

So how is SARSA not just a MC estimate of ExpSARSA? and since MC is unbiased, why not should SARSA also be an off-policy algorithm?


Edit: since seems like it's that clear what is the MC estimate I'm referring to, for some reason, my question is:

the two update differ only by an expected value, and $Q(s', a'\sim\pi(s')) \approx \mathbb{E}_{a\sim\pi(s')}[Q(s', a)]$, since it is a one sample monte carlo estimate, which is an unbiased estimate, so there is no reason why one should be on policy and the other off policy, since in expectation they lead to the same update

$\endgroup$

1 Answer 1

0
$\begingroup$

If you view SARSA "effectively" approximates the ExpSARSA by sampling a single action at each update step, making it a one sample MC estimate of the expected Q-values throughout only one episode, SARSA remains on-policy because its later updates in the same episode are caused by a specific choice of action sampled from its current $\epsilon$-greedy policy reflecting its all previous specific chosen actions' cumulative contributions. Further SARSA's later episode's behavior policy will be influenced by previous episodes via the feedback loop between policy and Q-values.

In contrast, in MC methods the policy is typically fixed within every episode and waits until the entire episode is completed to update the focused Q-values of a certain state-action pair based on empirical returns starting from the said state-action chain to the end of the episode. Therefore SARSA cannot be simply viewed as one sample MC estimate of ExpSARSA.

$\endgroup$
7
  • $\begingroup$ i’m not talking about MC method, i’m saying that sarda is a one sample MC estimate of the exp-sarsa update $\endgroup$ Commented Nov 7, 2024 at 7:21
  • 1
    $\begingroup$ @Alberto Ok, but please note your question is exactly "So how is SARSA not just a MC estimate of ExpSARSA?", not something like one sample of MC estimate. If that's the case you imagine a MC method where each independent episode is ExpSARS or you view ExSARSA as an MC estimate of SARSA or something else? It could be very confusing here due to multiple combination possibilities without more clarification. $\endgroup$ Commented Nov 7, 2024 at 7:40
  • $\begingroup$ "one sample MC" and "an MC estimate" are the same thing, you are estimating an expected value with a single point from such distribution $\endgroup$ Commented Nov 7, 2024 at 9:04
  • $\begingroup$ @Alberto I've updated my answer per your update, hope it clarifies. $\endgroup$ Commented Nov 8, 2024 at 5:38
  • $\begingroup$ yes, but my point is: it’s not the update that makes SARSA on policy, but rather the fact that you use the sane policy for exploring $\endgroup$ Commented Nov 8, 2024 at 11:58

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.