Skip to main content

Questions tagged [reinforcement-learning]

A set of dynamic strategies by which an algorithm can learn the structure of an environment online by adaptively taking actions associated with different rewards so as to maximize the rewards earned.

Filter by
Sorted by
Tagged with
1 vote
0 answers
44 views

so i'm training DDPG agent on 6 axis arm robot to move an object from A to B. The inputs are the coordinate of the joints along with the coordinate of the object that need to be moved. So, i'm kinda ...
Bejo's user avatar
  • 11
0 votes
0 answers
65 views

Basically, the question above: in RL, people typically encode the state as a tensor consisting of a plane with "channels", i.e. original Alpha Zero paper. These channels are typically one-...
FriendlyLagrangian's user avatar
1 vote
0 answers
32 views

I'm developing an AI for a 1v1 game. I have already programmed a system for generating these rewards. Currently, I have some heuristics and am using linear weights tuned with a genetic algorithm to ...
vbxr's user avatar
  • 11
0 votes
0 answers
26 views

In RL as inference framework (Levine, 2018) and application of language modeling, in particular seq2seq modeling, we care about learning a policy $p_\theta(y \mid x)$, where $x$ is the input sequence ...
Kaiwen's user avatar
  • 307
1 vote
0 answers
50 views

Given a finite computational budget $C$, suppose we want to learn an optimal policy $\Pi$ for large language models that dynamically decides between two inference-time reasoning strategies: deepening ...
user1666769's user avatar
2 votes
1 answer
209 views

I'm trying to understanding the effect of clipping the policy loss in PPO. PPO maximizes: $$\mathbb{E}[\text{min}(r_t(\theta)A_t, \text{clip}(1-\epsilon, 1+\epsilon, r_t(\theta))A_t)]\,,$$ where $r_t(\...
ludog's user avatar
  • 61
0 votes
0 answers
75 views

I hope this is the right place to ask this question, but here goes. I'm building a PPO agent to play Ticket to Ride (Northern Lights edition) for fun, but my loss plots aren't showing a lot of promise ...
MKJ's user avatar
  • 175
6 votes
1 answer
287 views

I was watching this Stanford CS229 lectures by Andrew Ng (Exact time stamp Link), and while trying to derive the Bellman equation for a policy $ \pi $, I ran into a conceptual doubt. Here's the ...
JustCurious's user avatar
1 vote
1 answer
70 views

During training and evaluating network model on an aircombat environment by PPO rl algorithm, it was surprisingly found that, credible evaluation of trained model needs far more episodes than ...
zhixin's user avatar
  • 31
0 votes
0 answers
42 views

I am doing a project where I want to estimate the likelihood of an agent's mixed strategy for a current state given a list of previous state action pairs. I want to find a prior distribution on the ...
MrFlapjack's user avatar
0 votes
0 answers
42 views

I am using Stable Baselines3’s implementation of Proximal Policy Optimisation (PPO) with a custom Graph Neural Network (GNN) architecture for both the actor and critic. My discrete action space ...
Pronitron's user avatar
  • 315
4 votes
1 answer
202 views

I am doing a finance related project, where we take the 'market' into account represented by covariance matrixes and economic indicators. As market participants are price takers, as we cannot ...
dragonforce's user avatar
1 vote
0 answers
107 views

I am evaluating the performance of a reinforcement learning (RL) agent trained on a custom environment using DQN (based on Gym). The current evaluation process involves running the agent on the same ...
desert_ranger's user avatar
1 vote
1 answer
103 views

I'm looking at this notebook: https://github.com/WhatIThinkAbout/BabyRobot/blob/master/Multi_Armed_Bandits/Part%205b%20-%20Thompson%20Sampling%20using%20Conjugate%20Priors.ipynb It describes methods ...
Florin Andrei's user avatar
0 votes
0 answers
60 views

I am implementing a reinforcement learning (RL) trading bot using a custom Gym environment. For simplicity, I assume I know the future prices of an asset. Here is how my setup works: Position: ...
vladkkkkk's user avatar
  • 701
2 votes
1 answer
142 views

In Sutton and Barto's book on RL, it says The ratio $\rho_{t:T-1}$ transforms the returns to have the right expected value: $$\mathbf{E}[\rho_{t:T-1} G_t | S_t = s] = v_\pi (s).\tag{5.4}$$ The book ...
Lazy Guy's user avatar
1 vote
1 answer
86 views

In Page 101-102 of Sutton and Barto's Book on RL (2018 edition) where it is desired to prove that $\epsilon$-greedy is an improvement over $\epsilon$-soft policies It is written Thus, by the policy ...
Lazy Guy's user avatar
0 votes
0 answers
70 views

I'm working with the Probability of Improvement (POI) metric described in [1], Section 4.3. The paper introduces various aggregate metrics in Section 4.3, and for most of these metrics (IQM, mean, ...
desert_ranger's user avatar
0 votes
0 answers
28 views

I'm working on comparing two reinforcement learning algorithms where: Running experiments is extremely computationally expensive Based on preliminary results, Algorithm B consistently and ...
desert_ranger's user avatar
3 votes
0 answers
95 views

I hope you're all doing well. I am currently working on a reinforcement learning problem to solve an optimization problem in wireless networks and I'm having troubles with designing the reward and ...
Mehran Varshosaz's user avatar
4 votes
1 answer
157 views

(Related question is here: Deriving 'State-Action marginal' in Reinforcement Learning) The lecture of CS 285 (Berkeley) https://www.youtube.com/watch?v=GKoKNYaBvM0&list=...
Jing's user avatar
  • 41
1 vote
1 answer
58 views

I am reading Sutton&Barto's book. I stucked at exercise 3.13. The question is write qπ in terms of vπ and p(s′,r∣s,a). I traced these steps: $q_\pi(s,a) = \sum_g g \text{ Pr}\{G_t=g|S_t=s, A_t=a\}$...
Huseyin Okan Demir's user avatar
2 votes
1 answer
140 views

So, expected SARSA defines the update as: $$ Q(s,a) = Q(s,a) +\alpha (R+ \mathbb{E}_{a\sim\pi(s')}[Q(s', a)] - Q(s,a)) $$ Where SARSA defines the update as $a'\sim\pi(s')$: $$ Q(s,a) = Q(s,a) +\alpha (...
Alberto's user avatar
  • 1,561
1 vote
0 answers
37 views

All the adaptive learning methods, AdaGrad, AdaDelta, RMSprop, ADAM, and later variants all require $g_t^2$ which is the gradient multiplying itself in the elementwise fashion. Why is this needed? I ...
Your neighbor Todorovich's user avatar
1 vote
1 answer
90 views

In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...
Rafay Khan's user avatar
2 votes
1 answer
157 views

It seems to me that Generalized Policy Iteration (GPI) and Actor-Critic are the same, and Q-learning methods are a separate family of algorithms. I think both GPI and Actor-Critic describe the ...
Daniel Mendoza's user avatar
2 votes
1 answer
78 views

I was reading Spinning Up in DRL. I was wondering if the objective in policy gradient algorithms, the $J_\theta$, is exactly the expected value function $E_{S_0}[V^\pi(S_0)]$. I've never seen people ...
Daniel Mendoza's user avatar
3 votes
1 answer
142 views

I recently learned about the concept of prediction intervals (for regression) and I would like to apply them to my Deep Reinforcement Learning algorithm. I am working with a Model-Free RL algorithm ...
desert_ranger's user avatar
1 vote
0 answers
40 views

In the original paper, proof of $|\bar{A}(s)|\leq2\alpha\max_{s,a}|A_\pi(s,a)|$, where $\bar{A}(s)=\mathbb{E}_{\tilde{a}\sim\tilde{\pi}}[A_{\pi}(s,\tilde{a})]$ goes like this $$\begin{equation}\begin{...
joey's user avatar
  • 11
0 votes
1 answer
883 views

I'm currently working on implementing the Upper Confidence Bound (UCB) algorithm for the Multiarmed Bandit Problem, but I'm encountering some difficulties with the computation. Here's what I've ...
mehruddin's user avatar
1 vote
0 answers
118 views

I started to self-taught reinforcement learning a few weeks ago. These days I've encountered a problem with the definition of the reward function. The reward function, defines and quantifies the ...
SuperSlow's user avatar
2 votes
3 answers
174 views

Can a deep learning classifier, trained on a dataset derived from a reinforcement learning (RL) agent's interactions with an environment, achieve the same performance as the RL agent itself? Assuming ...
yang's user avatar
  • 149
2 votes
1 answer
182 views

In the InstructGPT paper they define objective of RL stage as: They try to maximize this objective using PPO. I have trouble understanding how they plug this objective into the PPO though. Do they ...
Druudik's user avatar
  • 153
1 vote
2 answers
160 views

The question that follows is from a machine learning textbook (Reinforcement learning Suttion and Barto page 39 link). Given: a probability distribution over actions $x$ (a policy) at time $t$ ...
stochasticmrfox's user avatar
1 vote
1 answer
129 views

I'm having trouble identifying what statistical model or methodology is suited for my application. My situation is as follows: I want to create a stock trading agent that trades a single stock-cash ...
QMath's user avatar
  • 461
1 vote
0 answers
44 views

I am trying to design a neuron network for an scalable target assignment problem and use RL to train it by reward feedback. My major concern is making the neuron network somehow adaptable to different ...
zhixin's user avatar
  • 31
1 vote
0 answers
136 views

I am currently trying to understand the implementation of batching in policy gradient / actor-critic methods. My understanding is that these methods in principal work as follows: collect a batch of $N$...
mathiasj's user avatar
5 votes
1 answer
200 views

In reinforcement learning, there is a state $s_t$, an action $a_t$, and a policy $\pi(a|s)$ that maps states to the Probability Distribution Function (PDF) of actions. The goal is to choose the ...
Colin T Bowers's user avatar
2 votes
1 answer
135 views

I have a problem where, effectively, my slot machines have very low payout probability (on the order of 1% for the "best" slot machines) and my goal is to minimize the number of actions to ...
Alexander Soare's user avatar
0 votes
0 answers
57 views

Can we expect that the two q tables converge together? which means that abs(Q1-Q2).max() converge to zero, Can we say that?
user396307's user avatar
2 votes
1 answer
571 views

Q-function approximators based on neural networks tend to overestimate the Q-function. Accordingly, reinforcement learning algorithms such as Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) use ...
yuri kilochek's user avatar
0 votes
0 answers
77 views

I got this graph for my loss. As you can see the distance between the two graphs is so much! Can we say it shows bias is large and it's underfitting? Is this thing that I just said true or isn't true?...
argo's user avatar
  • 101
0 votes
0 answers
31 views

I want to perform a pathfinding task in a graph using RL, I consider each node of the graph as a state, but there are and/or relations between these states.So I'm wondering if I can still use the ...
Rafa's user avatar
  • 1
1 vote
0 answers
72 views

I'm trying to figure out the convergence of the SARSA algorithm, but I need help. In the article "On the Convergence of Stochastic Iterative Dynamic Programming" by Jakkola, Jordana and ...
user's user avatar
  • 11
1 vote
1 answer
209 views

I am trying to implement the extension to Marginal Posterior Sampling for Slate Bandits, which is a context-free slate bandit algorithm that uses Thompson sampling with a Bernoulli prior. I want to ...
Lucidnonsense's user avatar
0 votes
1 answer
288 views

I have implemented a Thompson sampling algorithm with beta distribution that chooses between two processors to process the payments for each transaction such that it maximizes the success rate. For ...
Aayush Gupta's user avatar
1 vote
1 answer
446 views

I am a beginner at MAB. One thing that puzzles me these days: The regret of the UCB policy (and Thompson Sampling with no prior) for stochastic bandit is $\sqrt{KT\ln T}$, but the regret of the EXP3 ...
zxzx179's user avatar
  • 93
2 votes
1 answer
455 views

I'm working on an algorithm which is permitted to use a training set of approximately 250,000 dictionary words. I have built and providing here with a basic, working algorithm. This algorithm will ...
driver's user avatar
  • 121
2 votes
2 answers
151 views

I'm trying to make sense of the "The Bandit Gradient Algorithm as Stochastic Gradient Ascent" proof in Sutton and Barto's intro to RL textbook. I'm stuck on the line $E[(q_*(A_t)-B_t)\frac{\...
fyzr's user avatar
  • 21

1
2 3 4 5
22