As you said, each arm of the bandit problem follows a probability distribution with an unknown mean. The goal is to have a good estimate of the means of all arms, because in this case you would be able to determine the best arm.
Let's suppose we have $K$ arms and let's call $\mu_k$ the mean of arm $k \in \{1, \dots, K\}$.
Let's say that you know how many arms you will pull (i.e. the number of rounds), and let's call $n$ this number. This is what you called total_pulls. (You didn't mention if $n$ was known in your case, but there are some variants of UCB that works when you don't know $n$).
During each round $s \in \{1, \dots, n\}$, you will choose an arm $A_s \in \{1, \dots, K\}$ which will give you $X_s$ a random reward that follows the distribution of arm $A_s$ with mean $\mu_{A_s}$.
A way to estimate the mean of a distribution is to take the empirical mean. So if you want to estimate the mean of arm $k$ at time $t$, you will take the empirical mean with the rewards you obtained from arm $k$ until the time $t$. In other words, your estimation would be $\hat{\mu}_k(t) = 1/N_k(t) \sum\limits_{s=1}^t X_s \mathbb{I}_{A_s = k}$ where $N_k(t)$ is the number of times you pulled arm $k$ until time $t$ (this is what you called arm_pulls), and $\mathbb{I}$ is the indicator function. Thus, $\hat{\mu}_k(t)$ is what you called mean_reward.
But the bandit problem lies in the exploration-exploitation dilemma. If you keep in mind that you want to maximize the cumulative reward in a finite number of rounds $n$, you don't want to pull too much an arm that gives you a low reward. On the other hand, you still want to explore because rewards are random so you are never sure if you had bad luck when you received the rewards. Also, to have a good estimation of the true mean $\mu_k$ of an arm, you have to get a lot of data from this arm.
The idea behind the classical UCB (Upper Confidence Bound) algorithm is to bound for each arm with high probability the true mean by the mean_reward plus the exploration_term you mentioned (for that you need an hypothesis on your rewards: for example independence and sub-gaussian, or independence and bounded). The more you will pull an arm, the more the bound will be close to the true mean for this arm. And then you will choose the action that maximizes this bound because you are optimistic concerning this bound.
In other words, your upper confidence bound for arm $k$ at time $t$ will be:
$$UCB_k(t) = \begin{cases}+\infty & \text{if $N_k(t)$ = 0} \\
\hat{\mu}_k(t) + 2\sqrt{\log(n) / N_k(t)} & \text{otherwise}\end{cases}$$.
And then you choose the action $A_{t+1} = \arg\max_{k \in \{1, \dots, K\}} UCB_k(t)$.
Concerning your last question, I think the difficulty for the implementation of the algorithm is to keep in mind you don't have one empirical mean, but several empirical means for each arm. Thus, you have to keep in memory:
- the number of times you pulled each arm
- the empirical mean for each arm.
Also, you can implement the algorithm without keeping in memory the rewards until round $t$ by using this little trick:
$$\hat{\mu}_k(t+1) = \begin{cases} \hat{\mu}_k(t) & \text{if $A_{t+1} \ne k$} \\ \frac{1}{N_k(t) + 1} (N_k(t) \hat{\mu}_k(t) + X_{t+1}) & \text{otherwise} \end{cases}$$
Because if you suppose that $A_{t+1} = k$, you have:
\begin{align}
\hat{\mu}_k(t+1) &:= \frac{1}{N_k(t+1)} \sum\limits_{s=1}^{t+1} X_s \mathbb{I}_{A_s = k} \\
&= \frac{1}{N_k(t) + 1} \left(\sum\limits_{s=1}^{t} X_s \mathbb{I}_{A_s = k} + X_{t+1}\right) \quad \text{by hypothesis} \\
&= \frac{1}{N_k(t) + 1} (N_k(t) \hat{\mu}_k(t) + X_{t+1})
\end{align}