0
$\begingroup$

enter image description here

I'm currently working on implementing the Upper Confidence Bound (UCB) algorithm for the Multiarmed Bandit Problem, but I'm encountering some difficulties with the computation. Here's what I've understood so far and where I'm stuck:

In the Multiarmed Bandit Problem, we have a set of slot machines (arms), each with an unknown probability distribution of yielding rewards. The goal is to devise a strategy to maximize the total reward accumulated over a series of pulls.

One common approach is the UCB algorithm, where the idea is to balance between exploiting arms that have yielded high rewards so far and exploring arms that haven't been explored much yet.

What I've Done:

  • I've initialized each arm's parameters, such as the number of times it has been pulled and the total reward obtained from it.

  • I understand that the UCB formula typically involves a term for exploitation (mean reward) and a term for exploration (confidence interval), usually represented as
UCB = mean_reward + exploration_term
where exploration_term is typically proportional to sqrt(log(total_pulls) / arm_pulls).

My Issue: Despite understanding the concept, I'm struggling with the implementation of the UCB formula, particularly with determining the appropriate values for mean_reward and exploration_term. Here are my specific questions:

  •   How do I correctly calculate the mean reward for each arm?
    
  •   How should I properly compute the exploration term, and what should I use for total_pulls and arm_pulls in the formula?
    
  •   Are there any common pitfalls or mistakes to avoid when implementing the UCB algorithm?
    
$\endgroup$

1 Answer 1

0
$\begingroup$

As you said, each arm of the bandit problem follows a probability distribution with an unknown mean. The goal is to have a good estimate of the means of all arms, because in this case you would be able to determine the best arm.

Let's suppose we have $K$ arms and let's call $\mu_k$ the mean of arm $k \in \{1, \dots, K\}$.

Let's say that you know how many arms you will pull (i.e. the number of rounds), and let's call $n$ this number. This is what you called total_pulls. (You didn't mention if $n$ was known in your case, but there are some variants of UCB that works when you don't know $n$).

During each round $s \in \{1, \dots, n\}$, you will choose an arm $A_s \in \{1, \dots, K\}$ which will give you $X_s$ a random reward that follows the distribution of arm $A_s$ with mean $\mu_{A_s}$.

A way to estimate the mean of a distribution is to take the empirical mean. So if you want to estimate the mean of arm $k$ at time $t$, you will take the empirical mean with the rewards you obtained from arm $k$ until the time $t$. In other words, your estimation would be $\hat{\mu}_k(t) = 1/N_k(t) \sum\limits_{s=1}^t X_s \mathbb{I}_{A_s = k}$ where $N_k(t)$ is the number of times you pulled arm $k$ until time $t$ (this is what you called arm_pulls), and $\mathbb{I}$ is the indicator function. Thus, $\hat{\mu}_k(t)$ is what you called mean_reward.

But the bandit problem lies in the exploration-exploitation dilemma. If you keep in mind that you want to maximize the cumulative reward in a finite number of rounds $n$, you don't want to pull too much an arm that gives you a low reward. On the other hand, you still want to explore because rewards are random so you are never sure if you had bad luck when you received the rewards. Also, to have a good estimation of the true mean $\mu_k$ of an arm, you have to get a lot of data from this arm.

The idea behind the classical UCB (Upper Confidence Bound) algorithm is to bound for each arm with high probability the true mean by the mean_reward plus the exploration_term you mentioned (for that you need an hypothesis on your rewards: for example independence and sub-gaussian, or independence and bounded). The more you will pull an arm, the more the bound will be close to the true mean for this arm. And then you will choose the action that maximizes this bound because you are optimistic concerning this bound. In other words, your upper confidence bound for arm $k$ at time $t$ will be: $$UCB_k(t) = \begin{cases}+\infty & \text{if $N_k(t)$ = 0} \\ \hat{\mu}_k(t) + 2\sqrt{\log(n) / N_k(t)} & \text{otherwise}\end{cases}$$.

And then you choose the action $A_{t+1} = \arg\max_{k \in \{1, \dots, K\}} UCB_k(t)$.

Concerning your last question, I think the difficulty for the implementation of the algorithm is to keep in mind you don't have one empirical mean, but several empirical means for each arm. Thus, you have to keep in memory:

  • the number of times you pulled each arm
  • the empirical mean for each arm.

Also, you can implement the algorithm without keeping in memory the rewards until round $t$ by using this little trick:

$$\hat{\mu}_k(t+1) = \begin{cases} \hat{\mu}_k(t) & \text{if $A_{t+1} \ne k$} \\ \frac{1}{N_k(t) + 1} (N_k(t) \hat{\mu}_k(t) + X_{t+1}) & \text{otherwise} \end{cases}$$

Because if you suppose that $A_{t+1} = k$, you have: \begin{align} \hat{\mu}_k(t+1) &:= \frac{1}{N_k(t+1)} \sum\limits_{s=1}^{t+1} X_s \mathbb{I}_{A_s = k} \\ &= \frac{1}{N_k(t) + 1} \left(\sum\limits_{s=1}^{t} X_s \mathbb{I}_{A_s = k} + X_{t+1}\right) \quad \text{by hypothesis} \\ &= \frac{1}{N_k(t) + 1} (N_k(t) \hat{\mu}_k(t) + X_{t+1}) \end{align}

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.