Skip to main content

Questions tagged [multiarmed-bandit]

A problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation.

Filter by
Sorted by
Tagged with
0 votes
0 answers
58 views

I have $k = |K|$ arms with unknown distribution $\nu_\alpha$ over $[0,1]$ and unknown mean $\mu_\alpha \in [0,1]$ where $\alpha \in K$. The action $A_t \in \{1, \dots, K\}$ is chosen at time $t$ ...
worldsmithhelper's user avatar
1 vote
0 answers
51 views

In Chapter 11 (page 148) of Bandit Algorithms by Lattimore and Szepesvári, the expected regret for a policy $\pi$ in a $k$-armed adversarial bandit is defined as the expected difference between ...
entropy07's user avatar
2 votes
1 answer
163 views

In League of Legends, players can choose two summoner spells per game. These spells affect the winrate of the champion being played. With thousands of matches played every day, not all spell ...
Slei's user avatar
  • 23
1 vote
1 answer
103 views

I'm looking at this notebook: https://github.com/WhatIThinkAbout/BabyRobot/blob/master/Multi_Armed_Bandits/Part%205b%20-%20Thompson%20Sampling%20using%20Conjugate%20Priors.ipynb It describes methods ...
Florin Andrei's user avatar
0 votes
0 answers
42 views

I am trying to make a bias-reduced effect size estimator for a play-the-winner clinical trial. The problem is this: Suppose we have outcomes from two experimental arms in phase 2, let's call them mA ...
Helene Hoegsbro Thygesen's user avatar
0 votes
0 answers
47 views

I want to show that if the horizon $n$ is strictly less than the number of arms $k$ then every algorithm enjoys a regret of at least $$ \frac{n(2k-n-1)}{2k} $$ Now, Lattimore and Szepesvári start from ...
Navid Rashidian's user avatar
3 votes
1 answer
203 views

The $\epsilon$-greedy algorithm for $k$-armed bandit, tosses a coin with success probability $\epsilon$ at each round and does the following: If not successful chooses the best arm till now, and if ...
Navid Rashidian's user avatar
1 vote
1 answer
90 views

In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...
Rafay Khan's user avatar
3 votes
1 answer
197 views

What strategy maximises the total reward, on average, after $n$ trials, in this multi-armed bandit: two coins A and B, with probability of success $p_A$ and $p_B$ reward is $1$ on success, $0$ on ...
elemolotiv's user avatar
  • 1,250
0 votes
1 answer
883 views

I'm currently working on implementing the Upper Confidence Bound (UCB) algorithm for the Multiarmed Bandit Problem, but I'm encountering some difficulties with the computation. Here's what I've ...
mehruddin's user avatar
2 votes
1 answer
135 views

I have a problem where, effectively, my slot machines have very low payout probability (on the order of 1% for the "best" slot machines) and my goal is to minimize the number of actions to ...
Alexander Soare's user avatar
1 vote
1 answer
209 views

I am trying to implement the extension to Marginal Posterior Sampling for Slate Bandits, which is a context-free slate bandit algorithm that uses Thompson sampling with a Bernoulli prior. I want to ...
Lucidnonsense's user avatar
1 vote
0 answers
52 views

I am studying Chapter 36 Thompson Sampling of the book Bandit Algorithms by Lattimore and Szepesvari. The authors present two equivalent formulations of Thompson Sampling on page 460, and I am having ...
Extrava's user avatar
  • 123
0 votes
1 answer
288 views

I have implemented a Thompson sampling algorithm with beta distribution that chooses between two processors to process the payments for each transaction such that it maximizes the success rate. For ...
Aayush Gupta's user avatar
1 vote
1 answer
446 views

I am a beginner at MAB. One thing that puzzles me these days: The regret of the UCB policy (and Thompson Sampling with no prior) for stochastic bandit is $\sqrt{KT\ln T}$, but the regret of the EXP3 ...
zxzx179's user avatar
  • 93
2 votes
2 answers
151 views

I'm trying to make sense of the "The Bandit Gradient Algorithm as Stochastic Gradient Ascent" proof in Sutton and Barto's intro to RL textbook. I'm stuck on the line $E[(q_*(A_t)-B_t)\frac{\...
fyzr's user avatar
  • 21
2 votes
1 answer
1k views

What are the differences between Bayesian optimization and multi-armed bandit optimization? Are the problems equivalent when multi-armed bandit's action space is infinite?
noob's user avatar
  • 2,620
2 votes
1 answer
133 views

Is there a term or name (or better yet, strategies) for the following problem? Take a 'standard' $k$-armed multi-armed bandit problem (stochastic real rewards, IID pulls for a given arm), but instead ...
TLW's user avatar
  • 303
0 votes
1 answer
70 views

Very often in the literature authors state something like: "We consider a contextual linear bandit problem where at each round t, the learner receives a context vector $x_t \in R^d$ with norm 1&...
amarchin's user avatar
  • 223
0 votes
1 answer
130 views

I have an online experimentation setup with incoming customers split into 3 groups: Random (all arms are applied equally) 20% Model-based (an existing, optimal strategy is run) 40% MAB (Multi-armed ...
Tuan Minh Nguyen Hoang's user avatar
0 votes
1 answer
126 views

I'm having difficulty understanding how to compute Big-O for the upper bound on the regret in Exp3 algorithm. I think the actual algorithm isn't quite important for my question but since I couldn't ...
Rowing0914's user avatar
2 votes
0 answers
70 views

Let's say I have an A/B (/C etc.) test, where the outcome of each trial is draw from a multinomial distribution with unknown frequencies. Each possible outcome value $x_i$ has a specified utility, $...
user1502040's user avatar
1 vote
1 answer
460 views

I am trying to compare Epoch Greedy in Langford & Zhang's paper and the epsilon-greedy approach for contextual bandits as in Chen et al, 2020. My question is that are these the same algorithms?-- ...
user111092's user avatar
1 vote
1 answer
164 views

Suppose that I'm training a machine learning model to predict people's age by a picture of their faces. Lets say that I have a dataset of people from 1 year olds to 100 year olds. But I want to choose ...
noone's user avatar
  • 73
1 vote
0 answers
70 views

Let us consider a collection of local Bayesian optimization tasks, each employs a Gaussian Process model to find the local optimum (i.e. global optimum of that task). The goal is to design a ...
Shaun Han's user avatar
  • 183
1 vote
0 answers
132 views

I'm working with the Online Logistic Regression Algorithm (Algorithm 3) of Chapelle and Li in their paper, "An Empirical Evaluation of Thompson Sampling" (https://papers.nips.cc/paper/2011/...
MABQ's user avatar
  • 11
5 votes
1 answer
886 views

I have a problem similar to the 'Bernoulli bandit' problem in the exploration-exploitation paradigm, but without the exploitation element. In particular, I have many levers that I can pull and each ...
Oscar Cunningham's user avatar
2 votes
1 answer
571 views

I'm new Reinforcement learning and currently reading Sutton & Barto's book "Reinforcement Learning: An Introduction". In Chapter 2, they compare greedy and non-greedy methods on 10-armed ...
xabush's user avatar
  • 151
2 votes
1 answer
93 views

Let's say we have a bandit with two arms, and we know that one arm has a reward probability 0.5 and the other is unknown. How do we create a strategy to maximise the reward?
Zuz's user avatar
  • 21
2 votes
1 answer
580 views

Does there exist a technique, such that while computing the returns of multi-armed bandits, we have the possibility of introducing an extra bandit? If the number of bandits was fixed, we could ...
desert_ranger's user avatar
0 votes
1 answer
60 views

I have a dataframe , here above a sample : ...
user17241's user avatar
  • 249
4 votes
1 answer
393 views

I'm wondering if there's an algorithm that minimizes the expected posterior loss for the best performing bandit where regret is calculated as the number of trials to achieve a threshold for posterior ...
mihagazvoda's user avatar
2 votes
1 answer
166 views

In the multi-armed bandit problem, I would like to clarify exactly what happens from time step $t=1$ in the context of the epsilon greedy strategy for $\epsilon=0$ and $0<\epsilon \leq 1$. By what ...
Slim Shady's user avatar
3 votes
1 answer
843 views

I am following the book Bandit Algorithms. In page 48, they introduces regret after $n$ rounds as $$ \mathbf{R} = n\mu^\star - \mathbb{E}\Bigg[\sum_{t=1}^n \mathbf{X}_t\Bigg] \tag{1} $$ In page 55, ...
Shew's user avatar
  • 297
1 vote
1 answer
149 views

This is the dilemma that I have faced in applied probability in general. Say you have the choice to put your savings of $\$10$ in a deposit account with guaranteed retun of $\$100$ or buy a lottery ...
Abhay Gupta's user avatar
1 vote
0 answers
166 views

I’m implementing an adaptive experimental design where arms are assigned according to the posterior probability that they are the best arm. I’ve noticed in several articles that people use ridge ...
Yrv88's user avatar
  • 11
4 votes
0 answers
572 views

I am trying to implement a simple simulation of Thompson sampling for pricing inspired by Python code from here. Another very similar/realted post can be found here. The idea is that I have different ...
cs0815's user avatar
  • 2,255
1 vote
0 answers
175 views

I made a Monte Carlo tree search (MCTS) algorithm for the travelling salesman problem inspired by this paper which uses UCB1. When I was digging to see where does the UCB1 formula comes from, I read ...
Butanium's user avatar
  • 111
2 votes
1 answer
152 views

I get the gist of Thompson sampling for price optimisation (I think - see this video around minute 31). I wonder, would Thompson sampling require discriminative pricing or can prices be change ...
cs0815's user avatar
  • 2,255
1 vote
0 answers
72 views

I have a riddle that i cannot solve: I'm a recruiter searching for the best basketball player in a town. There are 100 candidates in the town. 99 of them have a probability of basket the ball of 0.501,...
wanttoknow's user avatar
0 votes
1 answer
302 views

We want to pose one problem as a multi-armed bandit setting. The issue is that some of the arms are very risky with potentially undesirable effects (or not). Is there a way to do a risk-aware ...
d56's user avatar
  • 101
4 votes
1 answer
289 views

Similar to my other question Bandit-like setup but taking max reward over multiple heads?, I'm interested in situations like the Multi-Armed Bandit setup, except where the reward is aggregated a ...
Oly's user avatar
  • 180
3 votes
1 answer
302 views

If I have a process where I can evaluate one of a number of options per 'round', with variable reward, and I want to maximise reward over time, the multi-armed bandit literature has lots of useful ...
Oly's user avatar
  • 180
4 votes
2 answers
2k views

Multi-armed bandits are wonderful and have lots of potential applications. However, I don't know many companies or real-world practitioners who have implemented bandit algorithms. What are some ...
ABC's user avatar
  • 499
1 vote
0 answers
39 views

What is the difference between these formulas because I am confused with the difference between them, from what I understand is that the first equation is for stationary situation, while the second ...
Mohammed AL-Nashriy's user avatar
0 votes
1 answer
115 views

Could someone explains to me the notation of this function, I mean I understand that we take the average of sum of the rewards for some particular action, however the notation seems strange to me for ...
Mohammed AL-Nashriy's user avatar
6 votes
0 answers
342 views

There was a paper by Yasin-Abbasi-Yadkori https://arxiv.org/pdf/1102.2670.pdf titled Online Least Squares Estimation with Self-Normalized Processes. I am trying to give a brief context before asking ...
rostader's user avatar
  • 183
2 votes
1 answer
168 views

I'm working on a project where I continuously (in batches) update the pdf estimation for an event normally distributed. My variance is unknown, so I'm using the equations given in session 4.1.2 of ...
jcp's user avatar
  • 551
1 vote
0 answers
27 views

I came across the following problem: Consider a competition in which a game is played between two participants. There are total $n$ participants. Let $p_{ij}$ represent participant $i$ will beat ...
Rnj's user avatar
  • 225
5 votes
1 answer
253 views

In a paper on Multi Armed Bandits, I came across the following statement: This generalizes the well-known fact that one needs of order $\frac{1}{\Delta^2}$ samples to differentiate the means of two ...
D. B.'s user avatar
  • 59