Newest 'reinforcement-learning' Questions

1 vote

0 answers

44 views

Running statistics standardization in reinforcement learning

so i'm training DDPG agent on 6 axis arm robot to move an object from A to B. The inputs are the coordinate of the joints along with the coordinate of the object that need to be moved. So, i'm kinda ...

Bejo

11

asked Oct 1 at 9:03

0 votes

0 answers

65 views

Why is one-hot encoding used in RL instead of binary encoding?

Basically, the question above: in RL, people typically encode the state as a tensor consisting of a plane with "channels", i.e. original Alpha Zero paper. These channels are typically one-...

FriendlyLagrangian

101

asked Sep 4 at 9:06

1 vote

0 answers

32 views

PPO-like approach but with search algorithm

I'm developing an AI for a 1v1 game. I have already programmed a system for generating these rewards. Currently, I have some heuristics and am using linear weights tuned with a genetic algorithm to ...

vbxr

11

asked Aug 9 at 7:48

0 votes

0 answers

26 views

Optimal policy under RL as inference framework

In RL as inference framework (Levine, 2018) and application of language modeling, in particular seq2seq modeling, we care about learning a policy $p_\theta(y \mid x)$, where $x$ is the input sequence ...

Kaiwen

307

asked Aug 5 at 4:20

1 vote

0 answers

50 views

Optimal hybrid search for LLM reasoning

Given a finite computational budget $C$, suppose we want to learn an optimal policy $\Pi$ for large language models that dynamically decides between two inference-time reasoning strategies: deepening ...

user1666769

11

asked Jul 21 at 23:09

2 votes

1 answer

209 views

Why does PPO loss clip advantage to be more negative?

I'm trying to understanding the effect of clipping the policy loss in PPO. PPO maximizes: $$\mathbb{E}[\text{min}(r_t(\theta)A_t, \text{clip}(1-\epsilon, 1+\epsilon, r_t(\theta))A_t)]\,,$$ where $r_t(\...

ludog

61

asked Jun 27 at 10:11

0 votes

0 answers

75 views

How to make my RL model actually learn?

I hope this is the right place to ask this question, but here goes. I'm building a PPO agent to play Ticket to Ride (Northern Lights edition) for fun, but my loss plots aren't showing a lot of promise ...

MKJ

175

asked Jun 24 at 9:28

6 votes

1 answer

287 views

Problem in derivation of Bellman Equation

I was watching this Stanford CS229 lectures by Andrew Ng (Exact time stamp Link), and while trying to derive the Bellman equation for a policy $ \pi $, I ran into a conceptual doubt. Here's the ...

JustCurious

71

asked Jun 18 at 12:09

1 vote

1 answer

70 views

Credible evaluation of trained model needs far more episodes than heuristic policy

During training and evaluating network model on an aircombat environment by PPO rl algorithm, it was surprisingly found that, credible evaluation of trained model needs far more episodes than ...

zhixin

31

asked Apr 12 at 9:28

0 votes

0 answers

42 views

Categorical Distribution Likelihood with Weighed Dirichlet Priors

I am doing a project where I want to estimate the likelihood of an agent's mixed strategy for a current state given a list of previous state action pairs. I want to find a prior distribution on the ...

MrFlapjack

1

asked Apr 6 at 21:04

0 votes

0 answers

42 views

PPO with GNN Actor-Critic Ignores Optimal Action Sequence with Delayed Reward

I am using Stable Baselines3’s implementation of Proximal Policy Optimisation (PPO) with a custom Graph Neural Network (GNN) architecture for both the actor and critic. My discrete action space ...

Pronitron

315

asked Mar 25 at 9:54

4 votes

1 answer

202 views

Action Independent Transition Probability in Reinforcement Learning

I am doing a finance related project, where we take the 'market' into account represented by covariance matrixes and economic indicators. As market participants are price takers, as we cannot ...

dragonforce

51

asked Mar 23 at 1:15

1 vote

0 answers

107 views

How can I select a representative subset of starting states to efficiently estimate RL agent performance?

I am evaluating the performance of a reinforcement learning (RL) agent trained on a custom environment using DQN (based on Gym). The current evaluation process involves running the agent on the same ...

desert_ranger

682

asked Mar 20 at 2:07

0 votes

0 answers

19 views

Why does my deep reinforcement learning not converge at all? [duplicate]

...

waterbrother

1

asked Mar 14 at 2:31

1 vote

1 answer

103 views

Bayesian learning, sampling normal distributions with unknown mean and variance, how to estimate our confidence in the new mean?

I'm looking at this notebook: https://github.com/WhatIThinkAbout/BabyRobot/blob/master/Multi_Armed_Bandits/Part%205b%20-%20Thompson%20Sampling%20using%20Conjugate%20Priors.ipynb It describes methods ...

Florin Andrei

253

asked Feb 9 at 2:40

0 votes

0 answers

60 views

RL algorithm fail to converge for a simple trading bot

I am implementing a reinforcement learning (RL) trading bot using a custom Gym environment. For simplicity, I assume I know the future prices of an asset. Here is how my setup works: Position: ...

vladkkkkk

701

asked Jan 2 at 10:10

2 votes

1 answer

142 views

Off Policy TD(0) Derivation

In Sutton and Barto's book on RL, it says The ratio $\rho_{t:T-1}$ transforms the returns to have the right expected value: $$\mathbf{E}[\rho_{t:T-1} G_t | S_t = s] = v_\pi (s).\tag{5.4}$$ The book ...

Lazy Guy

35

asked Dec 26, 2024 at 12:08

1 vote

1 answer

86 views

Policy Improvement Theorem usage issues

In Page 101-102 of Sutton and Barto's Book on RL (2018 edition) where it is desired to prove that $\epsilon$-greedy is an improvement over $\epsilon$-soft policies It is written Thus, by the policy ...

Lazy Guy

35

asked Dec 20, 2024 at 8:44

0 votes

0 answers

70 views

Number of runs needed for Probability of Improvement metric in Deep RL

I'm working with the Probability of Improvement (POI) metric described in [1], Section 4.3. The paper introduces various aggregate metrics in Section 4.3, and for most of these metrics (IQM, mean, ...

desert_ranger

682

asked Dec 15, 2024 at 0:09

0 votes

0 answers

28 views

Statistical Testing with Minimal Samples for Reinforcement Learning Algorithms

I'm working on comparing two reinforcement learning algorithms where: Running experiments is extremely computationally expensive Based on preliminary results, Algorithm B consistently and ...

desert_ranger

682

asked Dec 6, 2024 at 13:42

3 votes

0 answers

95 views

Reward and Penalty Design in reinforcement learning

I hope you're all doing well. I am currently working on a reinforcement learning problem to solve an optimization problem in wireless networks and I'm having troubles with designing the reward and ...

Mehran Varshosaz

31

asked Dec 2, 2024 at 5:25

4 votes

1 answer

157 views

Deriving REINFORCE Algorithm with state-action marginals

(Related question is here: Deriving 'State-Action marginal' in Reinforcement Learning) The lecture of CS 285 (Berkeley) https://www.youtube.com/watch?v=GKoKNYaBvM0&list=...

Jing

41

asked Nov 21, 2024 at 6:05

1 vote

1 answer

58 views

action-value function in terms of state value function

I am reading Sutton&Barto's book. I stucked at exercise 3.13. The question is write qπ in terms of vπ and p(s′,r∣s,a). I traced these steps: $q_\pi(s,a) = \sum_g g \text{ Pr}\{G_t=g|S_t=s, A_t=a\}$...

Huseyin Okan Demir

223

asked Nov 15, 2024 at 11:40

2 votes

1 answer

140 views

Is Expected Sarsa is off-policy, and SARSA is just an MC estimate of Expected SARSA, why is it on-policy?

So, expected SARSA defines the update as: $$ Q(s,a) = Q(s,a) +\alpha (R+ \mathbb{E}_{a\sim\pi(s')}[Q(s', a)] - Q(s,a)) $$ Where SARSA defines the update as $a'\sim\pi(s')$: $$ Q(s,a) = Q(s,a) +\alpha (...

Alberto

1,561

asked Nov 6, 2024 at 15:53

1 vote

0 answers

37 views

Why does all these adaptive methods in neural network training require a $g_t^2$ term?

All the adaptive learning methods, AdaGrad, AdaDelta, RMSprop, ADAM, and later variants all require $g_t^2$ which is the gradient multiplying itself in the elementwise fashion. Why is this needed? I ...

Your neighbor Todorovich

707

asked Oct 20, 2024 at 9:55

1 vote

1 answer

90 views

Adding of Baseline parmter in derivation of Gradient Bandit Algorithm

In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...

Rafay Khan

121

asked Jul 16, 2024 at 11:02

2 votes

1 answer

157 views

What's the relation between Generalized Policy Iteration (GPI), Actor-Critic, and Q-learning methods?

It seems to me that Generalized Policy Iteration (GPI) and Actor-Critic are the same, and Q-learning methods are a separate family of algorithms. I think both GPI and Actor-Critic describe the ...

Daniel Mendoza

293

asked Jul 7, 2024 at 5:24

2 votes

1 answer

78 views

Is the objective function in policy gradient methods exactly the expected value function?

I was reading Spinning Up in DRL. I was wondering if the objective in policy gradient algorithms, the $J_\theta$, is exactly the expected value function $E_{S_0}[V^\pi(S_0)]$. I've never seen people ...

Daniel Mendoza

293

asked Jul 7, 2024 at 2:45

3 votes

1 answer

142 views

How do I develop prediction intervals for Reinforcement Learning?

I recently learned about the concept of prediction intervals (for regression) and I would like to apply them to my Deep Reinforcement Learning algorithm. I am working with a Model-Free RL algorithm ...

desert_ranger

682

asked May 8, 2024 at 3:08

1 vote

0 answers

40 views

Bafflement on lemma 2 in TRPO paper

In the original paper, proof of $|\bar{A}(s)|\leq2\alpha\max_{s,a}|A_\pi(s,a)|$, where $\bar{A}(s)=\mathbb{E}_{\tilde{a}\sim\tilde{\pi}}[A_{\pi}(s,\tilde{a})]$ goes like this $$\begin{equation}\begin{...

joey

11

asked Apr 1, 2024 at 7:56

0 votes

1 answer

883 views

How to compute Upper Confidence Bound Properly In Multiarmed Bandit Problem

I'm currently working on implementing the Upper Confidence Bound (UCB) algorithm for the Multiarmed Bandit Problem, but I'm encountering some difficulties with the computation. Here's what I've ...

mehruddin

1

asked Mar 20, 2024 at 5:41

1 vote

0 answers

118 views

Reward function definition in MRP/MDP, reinforcement learning different notations

I started to self-taught reinforcement learning a few weeks ago. These days I've encountered a problem with the definition of the reward function. The reward function, defines and quantifies the ...

SuperSlow

11

asked Mar 18, 2024 at 14:25

2 votes

3 answers

174 views

Are reinforcement learning and deep learning equivalent?

Can a deep learning classifier, trained on a dataset derived from a reinforcement learning (RL) agent's interactions with an environment, achieve the same performance as the RL agent itself? Assuming ...

yang

149

asked Mar 3, 2024 at 5:30

2 votes

1 answer

182 views

What's the loss that is optimized in InstructGPT RL stage?

In the InstructGPT paper they define objective of RL stage as: They try to maximize this objective using PPO. I have trouble understanding how they plug this objective into the PPO though. Do they ...

Druudik

153

asked Feb 7, 2024 at 20:04

1 vote

2 answers

160 views

In this RL problem, why is the substitution $q_*(A_t)=\mathbb{E}[R_t | A_t] \to R_t $ valid within this expectation (over actions)?

The question that follows is from a machine learning textbook (Reinforcement learning Suttion and Barto page 39 link). Given: a probability distribution over actions $x$ (a policy) at time $t$ ...

stochasticmrfox

1,657

asked Jan 13, 2024 at 0:11

1 vote

1 answer

129 views

What studied statistical model (if any) fits this application?

I'm having trouble identifying what statistical model or methodology is suited for my application. My situation is as follows: I want to create a stock trading agent that trades a single stock-cash ...

QMath

461

asked Jan 1, 2024 at 19:27

1 vote

0 answers

44 views

Scalable unordered category encoders

I am trying to design a neuron network for an scalable target assignment problem and use RL to train it by reward feedback. My major concern is making the neuron network somehow adaptable to different ...

zhixin

31

asked Dec 7, 2023 at 7:10

1 vote

0 answers

136 views

Batches in policy gradient methods – theory vs practice

I am currently trying to understand the implementation of batching in policy gradient / actor-critic methods. My understanding is that these methods in principal work as follows: collect a batch of $N$...

mathiasj

11

asked Nov 10, 2023 at 17:23

5 votes

1 answer

200 views

Is reinforcement learning conceptually equivalent to time-series with a latent dependent variable?

In reinforcement learning, there is a state $s_t$, an action $a_t$, and a policy $\pi(a|s)$ that maps states to the Probability Distribution Function (PDF) of actions. The goal is to choose the ...

Colin T Bowers

867

asked Nov 7, 2023 at 2:11

2 votes

1 answer

135 views

Do Bernoulli bandits need a different treatment if the rewards are sparse?

I have a problem where, effectively, my slot machines have very low payout probability (on the order of 1% for the "best" slot machines) and my goal is to minimize the number of actions to ...

Alexander Soare

611

asked Oct 6, 2023 at 13:34

0 votes

0 answers

57 views

Double q learning

Can we expect that the two q tables converge together? which means that abs(Q1-Q2).max() converge to zero, Can we say that?

user396307

1

asked Sep 10, 2023 at 12:59

2 votes

1 answer

571 views

Why do SAC and TD3 use multiple critic networks as opposed to single network with multiple outputs?

Q-function approximators based on neural networks tend to overestimate the Q-function. Accordingly, reinforcement learning algorithms such as Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) use ...

yuri kilochek

427

asked Sep 6, 2023 at 14:24

0 votes

0 answers

77 views

How this is possible? Test loss is under train

I got this graph for my loss. As you can see the distance between the two graphs is so much! Can we say it shows bias is large and it's underfitting? Is this thing that I just said true or isn't true?...

argo

101

asked Sep 5, 2023 at 19:53

0 votes

0 answers

31 views

Can MDP or reinforcement learning method be used on an And-Or graph?

I want to perform a pathfinding task in a graph using RL, I consider each node of the graph as a state, but there are and/or relations between these states.So I'm wondering if I can still use the ...

Rafa

1

asked Aug 31, 2023 at 10:14

1 vote

0 answers

72 views

Convergence of the SARSA algorithm

I'm trying to figure out the convergence of the SARSA algorithm, but I need help. In the article "On the Convergence of Stochastic Iterative Dynamic Programming" by Jakkola, Jordana and ...

user

11

asked Aug 29, 2023 at 15:24

1 vote

1 answer

209 views

Extending Bernoulli thompson sampling for slate bandit problems to the contextual setting

I am trying to implement the extension to Marginal Posterior Sampling for Slate Bandits, which is a context-free slate bandit algorithm that uses Thompson sampling with a Bernoulli prior. I want to ...

Lucidnonsense

479

asked Aug 18, 2023 at 14:45

0 votes

1 answer

288 views

Thompson Sampling with Two objectives - Cost and Success Rate

I have implemented a Thompson sampling algorithm with beta distribution that chooses between two processors to process the payments for each transaction such that it maximizes the success rate. For ...

Aayush Gupta

101

asked Jul 31, 2023 at 17:11

1 vote

1 answer

446 views

Understanding the regret bound of stochastic bandit vs. adversarial bandit

I am a beginner at MAB. One thing that puzzles me these days: The regret of the UCB policy (and Thompson Sampling with no prior) for stochastic bandit is $\sqrt{KT\ln T}$, but the regret of the EXP3 ...

zxzx179

93

asked Jul 30, 2023 at 0:55

2 votes

1 answer

455 views

Design an algorithm to improve the hangman game for letter prediction [closed]

I'm working on an algorithm which is permitted to use a training set of approximately 250,000 dictionary words. I have built and providing here with a basic, working algorithm. This algorithm will ...

driver

121

asked Jul 24, 2023 at 20:49

2 votes

2 answers

151 views

Trying to reproduce proof of Bandit Gradient Algorithm as SGD

I'm trying to make sense of the "The Bandit Gradient Algorithm as Stochastic Gradient Ascent" proof in Sutton and Barto's intro to RL textbook. I'm stuck on the line $E[(q_*(A_t)-B_t)\frac{\...

fyzr

21

asked Jul 21, 2023 at 0:05

Questions tagged [reinforcement-learning]