Questions tagged [reinforcement-learning]
A set of dynamic strategies by which an algorithm can learn the structure of an environment online by adaptively taking actions associated with different rewards so as to maximize the rewards earned.
1,083 questions
1
vote
0
answers
44
views
Running statistics standardization in reinforcement learning
so i'm training DDPG agent on 6 axis arm robot to move an object from A to B. The inputs are the coordinate of the joints along with the coordinate of the object that need to be moved.
So, i'm kinda ...
0
votes
0
answers
65
views
Why is one-hot encoding used in RL instead of binary encoding?
Basically, the question above: in RL, people typically encode the state as a tensor consisting of a plane with "channels", i.e. original Alpha Zero paper. These channels are typically one-...
1
vote
0
answers
32
views
PPO-like approach but with search algorithm
I'm developing an AI for a 1v1 game. I have already programmed a system for generating these rewards.
Currently, I have some heuristics and am using linear weights tuned with a genetic algorithm to ...
0
votes
0
answers
26
views
Optimal policy under RL as inference framework
In RL as inference framework (Levine, 2018) and application of language modeling, in particular seq2seq modeling, we care about learning a policy $p_\theta(y \mid x)$, where $x$ is the input sequence ...
1
vote
0
answers
50
views
Optimal hybrid search for LLM reasoning
Given a finite computational budget $C$, suppose we want to learn an optimal policy $\Pi$ for large language models that dynamically decides between two inference-time reasoning strategies: deepening ...
2
votes
1
answer
209
views
Why does PPO loss clip advantage to be more negative?
I'm trying to understanding the effect of clipping the policy loss in PPO.
PPO maximizes:
$$\mathbb{E}[\text{min}(r_t(\theta)A_t, \text{clip}(1-\epsilon, 1+\epsilon, r_t(\theta))A_t)]\,,$$
where $r_t(\...
0
votes
0
answers
75
views
How to make my RL model actually learn?
I hope this is the right place to ask this question, but here goes. I'm building a PPO agent to play Ticket to Ride (Northern Lights edition) for fun, but my loss plots aren't showing a lot of promise ...
6
votes
1
answer
287
views
Problem in derivation of Bellman Equation
I was watching this Stanford CS229 lectures by Andrew Ng (Exact time stamp Link), and while trying to derive the Bellman equation for a policy $ \pi $, I ran into a conceptual doubt.
Here's the ...
1
vote
1
answer
70
views
Credible evaluation of trained model needs far more episodes than heuristic policy
During training and evaluating network model on an aircombat environment by PPO rl algorithm, it was surprisingly found that, credible evaluation of trained model needs far more episodes than ...
0
votes
0
answers
42
views
Categorical Distribution Likelihood with Weighed Dirichlet Priors
I am doing a project where I want to estimate the likelihood of an agent's mixed strategy for a current state given a list of previous state action pairs. I want to find a prior distribution on the ...
0
votes
0
answers
42
views
PPO with GNN Actor-Critic Ignores Optimal Action Sequence with Delayed Reward
I am using Stable Baselines3’s implementation of Proximal Policy Optimisation (PPO) with a custom Graph Neural Network (GNN) architecture for both the actor and critic. My discrete action space ...
4
votes
1
answer
202
views
Action Independent Transition Probability in Reinforcement Learning
I am doing a finance related project, where we take the 'market' into account represented by covariance matrixes and economic indicators.
As market participants are price takers, as we cannot ...
1
vote
0
answers
107
views
How can I select a representative subset of starting states to efficiently estimate RL agent performance?
I am evaluating the performance of a reinforcement learning (RL) agent trained on a custom environment using DQN (based on Gym). The current evaluation process involves running the agent on the same ...
0
votes
0
answers
19
views
1
vote
1
answer
103
views
Bayesian learning, sampling normal distributions with unknown mean and variance, how to estimate our confidence in the new mean?
I'm looking at this notebook:
https://github.com/WhatIThinkAbout/BabyRobot/blob/master/Multi_Armed_Bandits/Part%205b%20-%20Thompson%20Sampling%20using%20Conjugate%20Priors.ipynb
It describes methods ...
0
votes
0
answers
60
views
RL algorithm fail to converge for a simple trading bot
I am implementing a reinforcement learning (RL) trading bot using a custom Gym environment. For simplicity, I assume I know the future prices of an asset. Here is how my setup works:
Position: ...
2
votes
1
answer
142
views
Off Policy TD(0) Derivation
In Sutton and Barto's book on RL, it says
The ratio $\rho_{t:T-1}$ transforms the returns to have the right expected value:
$$\mathbf{E}[\rho_{t:T-1} G_t | S_t = s] = v_\pi (s).\tag{5.4}$$
The book ...
1
vote
1
answer
86
views
Policy Improvement Theorem usage issues
In Page 101-102 of Sutton and Barto's Book on RL (2018 edition) where it is desired to prove that $\epsilon$-greedy is an improvement over $\epsilon$-soft policies It is written
Thus, by the policy ...
0
votes
0
answers
70
views
Number of runs needed for Probability of Improvement metric in Deep RL
I'm working with the Probability of Improvement (POI) metric described in [1], Section 4.3.
The paper introduces various aggregate metrics in Section 4.3, and for most of these metrics (IQM, mean, ...
0
votes
0
answers
28
views
Statistical Testing with Minimal Samples for Reinforcement Learning Algorithms
I'm working on comparing two reinforcement learning algorithms where:
Running experiments is extremely computationally expensive
Based on preliminary results, Algorithm B consistently and
...
3
votes
0
answers
95
views
Reward and Penalty Design in reinforcement learning
I hope you're all doing well.
I am currently working on a reinforcement learning problem to solve an optimization problem in wireless networks and I'm having troubles with designing the reward and ...
4
votes
1
answer
157
views
Deriving REINFORCE Algorithm with state-action marginals
(Related question is here: Deriving 'State-Action marginal' in Reinforcement Learning)
The lecture of CS 285 (Berkeley) https://www.youtube.com/watch?v=GKoKNYaBvM0&list=...
1
vote
1
answer
58
views
action-value function in terms of state value function
I am reading Sutton&Barto's book. I stucked at exercise 3.13. The question is write qπ in terms of vπ and p(s′,r∣s,a). I traced these steps:
$q_\pi(s,a) = \sum_g g \text{ Pr}\{G_t=g|S_t=s, A_t=a\}$...
2
votes
1
answer
140
views
Is Expected Sarsa is off-policy, and SARSA is just an MC estimate of Expected SARSA, why is it on-policy?
So, expected SARSA defines the update as:
$$
Q(s,a) = Q(s,a) +\alpha (R+ \mathbb{E}_{a\sim\pi(s')}[Q(s', a)] - Q(s,a))
$$
Where SARSA defines the update as $a'\sim\pi(s')$:
$$
Q(s,a) = Q(s,a) +\alpha (...
1
vote
0
answers
37
views
Why does all these adaptive methods in neural network training require a $g_t^2$ term?
All the adaptive learning methods, AdaGrad, AdaDelta, RMSprop, ADAM, and later variants all require $g_t^2$ which is the gradient multiplying itself in the elementwise fashion.
Why is this needed? I ...
1
vote
1
answer
90
views
Adding of Baseline parmter in derivation of Gradient Bandit Algorithm
In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...
2
votes
1
answer
157
views
What's the relation between Generalized Policy Iteration (GPI), Actor-Critic, and Q-learning methods?
It seems to me that Generalized Policy Iteration (GPI) and Actor-Critic are the same, and Q-learning methods are a separate family of algorithms. I think both GPI and Actor-Critic describe the ...
2
votes
1
answer
78
views
Is the objective function in policy gradient methods exactly the expected value function?
I was reading Spinning Up in DRL. I was wondering if the objective in policy gradient algorithms, the $J_\theta$, is exactly the expected value function $E_{S_0}[V^\pi(S_0)]$. I've never seen people ...
3
votes
1
answer
142
views
How do I develop prediction intervals for Reinforcement Learning?
I recently learned about the concept of prediction intervals (for regression) and I would like to apply them to my Deep Reinforcement Learning algorithm. I am working with a Model-Free RL algorithm ...
1
vote
0
answers
40
views
Bafflement on lemma 2 in TRPO paper
In the original paper, proof of $|\bar{A}(s)|\leq2\alpha\max_{s,a}|A_\pi(s,a)|$, where $\bar{A}(s)=\mathbb{E}_{\tilde{a}\sim\tilde{\pi}}[A_{\pi}(s,\tilde{a})]$ goes like this
$$\begin{equation}\begin{...
0
votes
1
answer
883
views
How to compute Upper Confidence Bound Properly In Multiarmed Bandit Problem
I'm currently working on implementing the Upper Confidence Bound (UCB) algorithm for the Multiarmed Bandit Problem, but I'm encountering some difficulties with the computation. Here's what I've ...
1
vote
0
answers
118
views
Reward function definition in MRP/MDP, reinforcement learning different notations
I started to self-taught reinforcement learning a few weeks ago. These days I've encountered a problem with the definition of the reward function.
The reward function, defines and quantifies the ...
2
votes
3
answers
174
views
Are reinforcement learning and deep learning equivalent?
Can a deep learning classifier, trained on a dataset derived from a reinforcement learning (RL) agent's interactions with an environment, achieve the same performance as the RL agent itself? Assuming ...
2
votes
1
answer
182
views
What's the loss that is optimized in InstructGPT RL stage?
In the InstructGPT paper they define objective of RL stage as:
They try to maximize this objective using PPO.
I have trouble understanding how they plug this objective into the PPO though. Do they ...
1
vote
2
answers
160
views
In this RL problem, why is the substitution $q_*(A_t)=\mathbb{E}[R_t | A_t] \to R_t $ valid within this expectation (over actions)?
The question that follows is from a machine learning textbook (Reinforcement learning Suttion and Barto page 39 link).
Given:
a probability distribution over actions $x$ (a policy) at time $t$ ...
1
vote
1
answer
129
views
What studied statistical model (if any) fits this application?
I'm having trouble identifying what statistical model or methodology is suited for my application.
My situation is as follows:
I want to create a stock trading agent that trades a single stock-cash ...
1
vote
0
answers
44
views
Scalable unordered category encoders
I am trying to design a neuron network for an scalable target assignment problem and use RL to train it by reward feedback. My major concern is making the neuron network somehow adaptable to different ...
1
vote
0
answers
136
views
Batches in policy gradient methods – theory vs practice
I am currently trying to understand the implementation of batching in policy gradient / actor-critic methods. My understanding is that these methods in principal work as follows: collect a batch of $N$...
5
votes
1
answer
200
views
Is reinforcement learning conceptually equivalent to time-series with a latent dependent variable?
In reinforcement learning, there is a state $s_t$, an action $a_t$, and a policy $\pi(a|s)$ that maps states to the Probability Distribution Function (PDF) of actions. The goal is to choose the ...
2
votes
1
answer
135
views
Do Bernoulli bandits need a different treatment if the rewards are sparse?
I have a problem where, effectively, my slot machines have very low payout probability (on the order of 1% for the "best" slot machines) and my goal is to minimize the number of actions to ...
0
votes
0
answers
57
views
Double q learning
Can we expect that the two q tables converge together? which means that abs(Q1-Q2).max() converge to zero, Can we say that?
2
votes
1
answer
571
views
Why do SAC and TD3 use multiple critic networks as opposed to single network with multiple outputs?
Q-function approximators based on neural networks tend to overestimate the Q-function. Accordingly, reinforcement learning algorithms such as Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) use ...
0
votes
0
answers
77
views
How this is possible? Test loss is under train
I got this graph for my loss.
As you can see the distance between the two graphs is so much!
Can we say it shows bias is large and it's underfitting?
Is this thing that I just said true or isn't true?...
0
votes
0
answers
31
views
Can MDP or reinforcement learning method be used on an And-Or graph?
I want to perform a pathfinding task in a graph using RL, I consider each node of the graph as a state, but there are and/or relations between these states.So I'm wondering if I can still use the ...
1
vote
0
answers
72
views
Convergence of the SARSA algorithm
I'm trying to figure out the convergence of the SARSA algorithm, but I need help. In the article "On the Convergence of Stochastic Iterative Dynamic Programming" by Jakkola, Jordana and ...
1
vote
1
answer
209
views
Extending Bernoulli thompson sampling for slate bandit problems to the contextual setting
I am trying to implement the extension to Marginal Posterior Sampling for Slate Bandits, which is a context-free slate bandit algorithm that uses Thompson sampling with a Bernoulli prior.
I want to ...
0
votes
1
answer
288
views
Thompson Sampling with Two objectives - Cost and Success Rate
I have implemented a Thompson sampling algorithm with beta distribution that chooses between two processors to process the payments for each transaction such that it maximizes the success rate. For ...
1
vote
1
answer
446
views
Understanding the regret bound of stochastic bandit vs. adversarial bandit
I am a beginner at MAB. One thing that puzzles me these days:
The regret of the UCB policy (and Thompson Sampling with no prior) for stochastic bandit is $\sqrt{KT\ln T}$, but the regret of the EXP3 ...
2
votes
1
answer
455
views
Design an algorithm to improve the hangman game for letter prediction [closed]
I'm working on an algorithm which is permitted to use a training set of approximately 250,000 dictionary words.
I have built and providing here with a basic, working algorithm. This algorithm will ...
2
votes
2
answers
151
views
Trying to reproduce proof of Bandit Gradient Algorithm as SGD
I'm trying to make sense of the "The Bandit Gradient Algorithm as Stochastic Gradient Ascent" proof in Sutton and Barto's intro to RL textbook. I'm stuck on the line
$E[(q_*(A_t)-B_t)\frac{\...