Questions tagged [multiarmed-bandit]
A problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation.
187 questions
0
votes
0
answers
58
views
Binary Multi-armed bandit with unbounded time varying choice independent noise
I have $k = |K|$ arms with unknown distribution $\nu_\alpha$ over $[0,1]$ and unknown mean $\mu_\alpha \in [0,1]$ where $\alpha \in K$.
The action $A_t \in \{1, \dots, K\}$ is chosen at time $t$ ...
1
vote
0
answers
51
views
Why is regret defined relative to the best fixed arm in hindsight rather than the best arm at each round in adversarial bandits?
In Chapter 11 (page 148) of Bandit Algorithms by Lattimore and Szepesvári, the expected regret for a policy $\pi$ in a $k$-armed adversarial bandit is defined as the expected difference between ...
2
votes
1
answer
163
views
How to statistically determine the best summoner spell pairing in League of Legends?
In League of Legends, players can choose two summoner spells per game. These spells affect the winrate of the champion being played. With thousands of matches played every day, not all spell ...
1
vote
1
answer
103
views
Bayesian learning, sampling normal distributions with unknown mean and variance, how to estimate our confidence in the new mean?
I'm looking at this notebook:
https://github.com/WhatIThinkAbout/BabyRobot/blob/master/Multi_Armed_Bandits/Part%205b%20-%20Thompson%20Sampling%20using%20Conjugate%20Priors.ipynb
It describes methods ...
0
votes
0
answers
42
views
Bias-reduced estimator for play-the-winner clinical trial
I am trying to make a bias-reduced effect size estimator for a play-the-winner clinical trial.
The problem is this: Suppose we have outcomes from two experimental arms in phase 2, let's call them mA ...
0
votes
0
answers
47
views
Lower bound for stochastic bandits with short horizons
I want to show that if the horizon $n$ is strictly less than the number of arms $k$ then every algorithm enjoys a regret of at least
$$
\frac{n(2k-n-1)}{2k}
$$
Now, Lattimore and Szepesvári start from ...
3
votes
1
answer
203
views
Exact regret of $\epsilon$-greedy algorithm for $k$-armed bandit
The $\epsilon$-greedy algorithm for $k$-armed bandit, tosses a coin with success probability $\epsilon$ at each round and does the following:
If not successful chooses the best arm till now, and
if ...
1
vote
1
answer
90
views
Adding of Baseline parmter in derivation of Gradient Bandit Algorithm
In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...
3
votes
1
answer
197
views
Multi-armed bandit with 2 coins: What strategy maximises reward?
What strategy maximises the total reward, on average, after $n$ trials, in this multi-armed bandit:
two coins A and B, with probability of success $p_A$ and $p_B$
reward is $1$ on success, $0$ on ...
0
votes
1
answer
883
views
How to compute Upper Confidence Bound Properly In Multiarmed Bandit Problem
I'm currently working on implementing the Upper Confidence Bound (UCB) algorithm for the Multiarmed Bandit Problem, but I'm encountering some difficulties with the computation. Here's what I've ...
2
votes
1
answer
135
views
Do Bernoulli bandits need a different treatment if the rewards are sparse?
I have a problem where, effectively, my slot machines have very low payout probability (on the order of 1% for the "best" slot machines) and my goal is to minimize the number of actions to ...
1
vote
1
answer
209
views
Extending Bernoulli thompson sampling for slate bandit problems to the contextual setting
I am trying to implement the extension to Marginal Posterior Sampling for Slate Bandits, which is a context-free slate bandit algorithm that uses Thompson sampling with a Bernoulli prior.
I want to ...
1
vote
0
answers
52
views
Equivalent Formulations of Thompson Sampling
I am studying Chapter 36 Thompson Sampling of the book Bandit Algorithms by Lattimore and Szepesvari. The authors present two equivalent formulations of Thompson Sampling on page 460, and I am having ...
0
votes
1
answer
288
views
Thompson Sampling with Two objectives - Cost and Success Rate
I have implemented a Thompson sampling algorithm with beta distribution that chooses between two processors to process the payments for each transaction such that it maximizes the success rate. For ...
1
vote
1
answer
446
views
Understanding the regret bound of stochastic bandit vs. adversarial bandit
I am a beginner at MAB. One thing that puzzles me these days:
The regret of the UCB policy (and Thompson Sampling with no prior) for stochastic bandit is $\sqrt{KT\ln T}$, but the regret of the EXP3 ...
2
votes
2
answers
151
views
Trying to reproduce proof of Bandit Gradient Algorithm as SGD
I'm trying to make sense of the "The Bandit Gradient Algorithm as Stochastic Gradient Ascent" proof in Sutton and Barto's intro to RL textbook. I'm stuck on the line
$E[(q_*(A_t)-B_t)\frac{\...
2
votes
1
answer
1k
views
Difference between Bayesian optimization and multi-armed bandit optimization
What are the differences between Bayesian optimization and multi-armed bandit optimization? Are the problems equivalent when multi-armed bandit's action space is infinite?
2
votes
1
answer
133
views
Multi-armed bandit with max instead of mean
Is there a term or name (or better yet, strategies) for the following problem?
Take a 'standard' $k$-armed multi-armed bandit problem (stochastic real rewards, IID pulls for a given arm), but instead ...
0
votes
1
answer
70
views
Context vector with norm 1
Very often in the literature authors state something like: "We consider a contextual linear bandit problem where at each round t, the learner receives a context vector $x_t \in R^d$ with norm 1&...
0
votes
1
answer
130
views
Bandit learning with biased and unbiased data
I have an online experimentation setup with incoming customers split into 3 groups:
Random (all arms are applied equally) 20%
Model-based (an existing, optimal strategy is run) 40%
MAB (Multi-armed ...
0
votes
1
answer
126
views
Big-O of Upperbound on the Regret of Exp3
I'm having difficulty understanding how to compute Big-O for the upper bound on the regret in Exp3 algorithm.
I think the actual algorithm isn't quite important for my question but since I couldn't ...
2
votes
0
answers
70
views
Estimating probability of superiority in an A/B test with multinomial outcomes
Let's say I have an A/B (/C etc.) test, where the outcome of each trial is draw from a multinomial distribution with unknown frequencies. Each possible outcome value $x_i$ has a specified utility, $...
1
vote
1
answer
460
views
Difference between Epoch-greedy and Epsilon-Greedy algorithm for contextual bandits
I am trying to compare Epoch Greedy in Langford & Zhang's paper and the epsilon-greedy approach for contextual bandits as in Chen et al, 2020. My question is that are these the same algorithms?-- ...
1
vote
1
answer
164
views
Minimum sampling for maximising the prediction accuracy [closed]
Suppose that I'm training a machine learning model to predict people's age by a picture of their faces. Lets say that I have a dataset of people from 1 year olds to 100 year olds. But I want to choose ...
1
vote
0
answers
70
views
How to solve this type of multi-task Bayesian optimization problem?
Let us consider a collection of local Bayesian optimization tasks, each employs a Gaussian Process model to find the local optimum (i.e. global optimum of that task). The goal is to design a ...
1
vote
0
answers
132
views
Data Imbalance in Contextual Bandit with Thompson Sampling
I'm working with the Online Logistic Regression Algorithm (Algorithm 3) of Chapelle and Li in their paper, "An Empirical Evaluation of Thompson Sampling" (https://papers.nips.cc/paper/2011/...
5
votes
1
answer
886
views
How do find the best arm in a multi-armed bandit when exploitation is unimportant?
I have a problem similar to the 'Bernoulli bandit' problem in the exploration-exploitation paradigm, but without the exploitation element.
In particular, I have many levers that I can pull and each ...
2
votes
1
answer
571
views
Understanding percentage of optimal action in Reinforcement Learning
I'm new Reinforcement learning and currently reading Sutton & Barto's book "Reinforcement Learning: An Introduction". In Chapter 2, they compare greedy and non-greedy methods on 10-armed ...
2
votes
1
answer
93
views
Strategy when introducing a new arm
Let's say we have a bandit with two arms, and we know that one arm has a reward probability 0.5 and the other is unknown. How do we create a strategy to maximise the reward?
2
votes
1
answer
580
views
Learning payoffs from variable number of armed bandits
Does there exist a technique, such that while computing the returns of multi-armed bandits, we have the possibility of introducing an extra bandit? If the number of bandits was fixed, we could ...
0
votes
1
answer
60
views
Best grouping rows method with Multi-Armed Bandit
I have a dataframe , here above a sample :
...
4
votes
1
answer
393
views
Multi-armed bandit algorithm for finding the best performing bandit in the least amount of trials
I'm wondering if there's an algorithm that minimizes the expected posterior loss for the best performing bandit where regret is calculated as the number of trials to achieve a threshold for posterior ...
2
votes
1
answer
166
views
Multi-armed bandit - how does the gambler choose what's the best strategy?
In the multi-armed bandit problem, I would like to clarify exactly what happens from time step $t=1$ in the context of the epsilon greedy strategy for $\epsilon=0$ and $0<\epsilon \leq 1$. By what ...
3
votes
1
answer
843
views
Difference between regret and pseudo-regret definitions
I am following the book Bandit Algorithms. In page 48, they introduces regret after $n$ rounds as
$$
\mathbf{R} = n\mu^\star - \mathbb{E}\Bigg[\sum_{t=1}^n \mathbf{X}_t\Bigg] \tag{1}
$$
In page 55, ...
1
vote
1
answer
149
views
In reinforcement learning/multi-armed bandits, why do we look at expected reward and not the most likely reward? [duplicate]
This is the dilemma that I have faced in applied probability in general. Say you have the choice to put your savings of $\$10$ in a deposit account with guaranteed retun of $\$100$ or buy a lottery ...
1
vote
0
answers
166
views
Why do linear bandits use ridge regression to estimate parameters?
I’m implementing an adaptive experimental design where arms are assigned according to the posterior probability that they are the best arm. I’ve noticed in several articles that people use ridge ...
4
votes
0
answers
572
views
Thompson sampling when the reward is not simply one
I am trying to implement a simple simulation of Thompson sampling for pricing inspired by Python code from here. Another very similar/realted post can be found here.
The idea is that I have different ...
1
vote
0
answers
175
views
Can Thomson sampling be used for better results in a 1 player-MCTS
I made a Monte Carlo tree search (MCTS) algorithm for the travelling salesman problem inspired by this paper which uses UCB1.
When I was digging to see where does the UCB1 formula comes from, I read ...
2
votes
1
answer
152
views
does Thompson sampling for price optimisation require discriminative pricing
I get the gist of Thompson sampling for price optimisation (I think - see this video around minute 31). I wonder, would Thompson sampling require discriminative pricing or can prices be change ...
1
vote
0
answers
72
views
Binomial riddle [closed]
I have a riddle that i cannot solve:
I'm a recruiter searching for the best basketball player in a town. There are 100 candidates in the town. 99 of them have a probability of basket the ball of 0.501,...
0
votes
1
answer
302
views
Risk-averse multi-armed bandits
We want to pose one problem as a multi-armed bandit setting. The issue is that some of the arms are very risky with potentially undesirable effects (or not). Is there a way to do a risk-aware ...
4
votes
1
answer
289
views
Bandit-like setup but taking max reward over sequential choices
Similar to my other question Bandit-like setup but taking max reward over multiple heads?, I'm interested in situations like the Multi-Armed Bandit setup, except where the reward is aggregated a ...
3
votes
1
answer
302
views
Bandit-like setting with maximum reward over multiple arms?
If I have a process where I can evaluate one of a number of options per 'round', with variable reward, and I want to maximise reward over time, the multi-armed bandit literature has lots of useful ...
4
votes
2
answers
2k
views
Real-World, Operationalized Applications of Multi-Arm Bandits
Multi-armed bandits are wonderful and have lots of potential applications. However, I don't know many companies or real-world practitioners who have implemented bandit algorithms.
What are some ...
1
vote
0
answers
39
views
Nonstationary and stationaryProblem [closed]
What is the difference between these formulas because I am confused with the difference between them, from what I understand is that the first equation is for stationary situation, while the second ...
0
votes
1
answer
115
views
Multi-armed Bandits
Could someone explains to me the notation of this function, I mean I understand that we take the average of sum of the rewards for some particular action, however the notation seems strange to me for ...
6
votes
0
answers
342
views
Confidence Interval for least squares estimator
There was a paper by Yasin-Abbasi-Yadkori https://arxiv.org/pdf/1102.2670.pdf titled Online Least Squares Estimation with Self-Normalized Processes. I am trying to give a brief context before asking ...
2
votes
1
answer
168
views
Batches of bayesian updates for gaussian with unknown variance different from computation with all data
I'm working on a project where I continuously (in batches) update the pdf estimation for an event normally distributed. My variance is unknown, so I'm using the equations given in session 4.1.2 of ...
1
vote
0
answers
27
views
Finding winner of the competition with give minimum probability by giving method that can carry out each game of the competition
I came across the following problem:
Consider a competition in which a game is played between two participants. There are total $n$ participants. Let $p_{ij}$ represent participant $i$ will beat ...
5
votes
1
answer
253
views
How many samples are needed to distinguish the means of two distributions in multi-armed bandits?
In a paper on Multi Armed Bandits, I came across the following statement:
This generalizes the well-known fact that one needs of order $\frac{1}{\Delta^2}$ samples to differentiate the means of two ...