Newest 'deep-learning' Questions

0 votes

0 answers

40 views

Independence and Correlation Structure of Weights Generated by a Hypernetwork

Suppose a hypernetwork $\mathcal{H}$ takes a latent variable $z \sim p_z(z)$, where $p_z$ is Gaussian, and outputs the parameters of another neural network $f$. In particular, each weight $w_i$ of $f$ ...

rando

360

asked Nov 21 at 19:13

0 votes

0 answers

19 views

Why does batch normalization make lower layers 'useless' in purely linear networks?

I'm reading the Deep Learning book by Goodfellow, Bengio, and Courville (Chapter 8 section 8.7.1 on Batch Normalization, page 315). The authors use a simple example of a deep linear network without ...

spierenb

11

asked Nov 18 at 15:20

1 vote

0 answers

24 views

Is there a work on trying to pretrain RNNs model? [closed]

I really want to play around with RNNs. Trying to build an AI assistant with RNNs to run on my machine as I'm always obsessed with RNNs model... To make the performance good, I think I need to do some ...

jupyter

111

asked Nov 5 at 5:25

0 votes

0 answers

9 views

Are there any other powerful optimization tools available besides the ABC and PSO algorithms? [duplicate]

What are other optimization tools that are powerful enough to improve the accuracy performance of the neural network model? Please give me recent tools that are powerful

bbadyalina

823

asked Nov 5 at 1:10

0 votes

0 answers

17 views

How to compare WT vs mutant predictions with MC Dropout ensemble (M=5, T=100) in a binary classifier?

I’m using an ensemble of M = 5 deep neural networks, each evaluated with T = 100 Monte Carlo dropout samples at test time to estimate predictive uncertainty. The model performs binary classification (...

Solomon123

11

asked Nov 3 at 8:30

0 votes

0 answers

26 views

Is it a bad idea to use Transformer models on long-tailed datasets?

I’m working on a video classification task with a long-tailed dataset where a few classes have many samples while most classes have very few. More specifically, my dataset has around 9k samples and 3....

Olivia

191

asked Nov 1 at 1:36

1 vote

0 answers

26 views

Does compositional structure (actually) mitigate the curse of dimensionality?

The paper "Deep Quantile Regression: Mitigating the Curse of Dimensionality Through Composition" makes the following claim (top of page 4): It is clear that smoothness is not the right ...

Chris

322

asked Oct 11 at 21:13

2 votes

0 answers

25 views

What causes the degradation problem - the higher training error in much deeper networks?

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...

Vignesh N

21

asked Oct 11 at 12:28

0 votes

0 answers

24 views

What is the expected ideal values for the losses of discrimintor when using generative adversarial imputaiton network to impute missing values?

I am new to GAIN (generative adversarial imputation network). I am trying to use GAIN to impute missing values. I have a quesiton about the values of the losses for the discriminator. Are the values ...

JonathonSoong

1

asked Oct 10 at 7:02

0 votes

0 answers

40 views

Multiplying probabilities of weights in Bayesian neural networks to formulate a prior

A key element in Bayesian neural networks is finding the probability of a set of weights, so that it can be applied to Bayes rule. I cannot think of many ways of doing this, for P(w) (also sometimes ...

user494234

21

asked Oct 5 at 15:02

1 vote

1 answer

58 views

Bayes-by-backprop - meaning of partial derivative

The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm: Note that the $\frac{∂f(w,θ)}{∂w}$ term of the gradients for the mean and standard ...

user494234

21

asked Oct 3 at 9:45

1 vote

1 answer

113 views

Function omitted during formula derivation (KL-divergence)

From the above, I am trying to derive the below: However, I do not see why the $q_\theta(w)$ has been omitted from $\log p(D)$, in equation 17 and 18. Here is my attempt to derive the above: $$\begin{...

user494234

21

asked Sep 18 at 13:56

3 votes

0 answers

60 views

Normalizing observations in a nonlinear state space model

I am modelling the the sequence $\{(a_t,y_t)\}_t$ as follows: $$ \begin{cases} Y_{t+1} &= g_\nu(X_{t+1}) + \alpha V_{t+1}\\ X_{t+1} &= X_t + \mu_\xi(a_t) + \sigma_\psi(a_t)Z_{t+1}\\ X_0 &= ...

Uomond

51

asked Sep 13 at 21:09

0 votes

0 answers

65 views

Why is one-hot encoding used in RL instead of binary encoding?

Basically, the question above: in RL, people typically encode the state as a tensor consisting of a plane with "channels", i.e. original Alpha Zero paper. These channels are typically one-...

FriendlyLagrangian

101

asked Sep 4 at 9:06

0 votes

0 answers

38 views

Why do flow neural networks that are trained to only simulate vector fields for specific timesteps perform poorly compared to regular models?

I am currently learning about flow matching models and wanted to test whether or not we could train a flow matching model on just two time steps 0 and 0.5 and sampling at only those two time steps to ...

Bill Wang

1

asked Sep 2 at 3:23

1 vote

0 answers

77 views

Do you need paired data to train multimodal?

I have video, audio, and text data. The intent is to use the multimodal for binary classification. However, the data is not paired (i.e The audio and text are not from the same video recording). I've ...

myts999

13

asked Aug 31 at 20:49

0 votes

0 answers

26 views

Does SWA imply high fisher info is bad?

Stochastic-Weight-Averaging (SWA) claims that deep learning MLE points in "flatter loss regions" improve generalization to holdout data. This is a famous paper in deep learning with 2000+ ...

profPlum

593

asked Aug 29 at 20:09

0 votes

0 answers

47 views

How do you pick the best model out of a set of models if you knew both the training error and validation error of each model?

An interesting question I stumbled upon today is this: Suppose I train models $m_1, m_2, m_3, \ldots, m_N$, where each $m_i$, $i = 1, \ldots, N$ is associated with a hyperparameter $i$. All models are ...

Your neighbor Todorovich

707

asked Aug 27 at 22:05

1 vote

1 answer

136 views

What is the current consensus on "using test set as training set, post testing"? [duplicate]

This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning. The gist of it is that ...

Your neighbor Todorovich

707

asked Aug 22 at 4:12

0 votes

0 answers

34 views

Training process for LSTM based sequence labelling

I'm training an LSTM to predict a binary anomaly sequence from multi-dimensional, irregularly sampled input sequences. While CNNs perform adequately, I'm struggling to get good performance from my ...

klobaska soslaninou

1

asked Aug 18 at 21:45

2 votes

0 answers

67 views

Preventing data leakage when using street-level aggregated features in classification

I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...

setty

161

asked Aug 14 at 19:26

1 vote

0 answers

32 views

PPO-like approach but with search algorithm

I'm developing an AI for a 1v1 game. I have already programmed a system for generating these rewards. Currently, I have some heuristics and am using linear weights tuned with a genetic algorithm to ...

vbxr

11

asked Aug 9 at 7:48

4 votes

1 answer

92 views

Proving universal approximation through ReLU

We were discussing universal approximation theorems for neural networks and showed that the triangular function $$ h(x) = \begin{cases} x+1, & x \in [-1,0] \\ 1-x, & x \in [0,1] \\ 0, & \...

CharComplexity

43

asked Aug 2 at 9:23

1 vote

0 answers

29 views

Addressing underfitting in pytorch video classification model

I am building a model to classify short videos of a person doing sign language (1-2 seconds, 30 fps, 512x512) into n labels I find that no matter which model I use (transformers, 3D CNN, ...) or ...

Jimmy

11

asked Jul 30 at 2:02

0 votes

1 answer

97 views

Why do most RAG pipelines fail at multi-step reasoning, even with correct chunk retrieval?

I’ve seen many retrieval-augmented generation (RAG) pipelines return highly relevant context chunks — and yet fail catastrophically on multi-hop reasoning. For example, even when the source document ...

PSBigBig

11

asked Jul 28 at 10:12

1 vote

0 answers

93 views

KL divergence and deep learning paradigm

My question is regarding the paradigm of deep learning, I do not get where does the cost functions come from? For example for a classification task are we treating the encoder as the expected value of ...

Kavalali

373

asked Jul 21 at 22:15

0 votes

0 answers

69 views

Proving Convergence of Mean and Variance in a Recursive Gaussian Update Process

I'm researching the statistical convergence properties of a recursive system that arises during the training of custom neural network structure. My specific question is: How can I prove convergence of ...

Guillaume

1

asked Jul 16 at 9:28

2 votes

0 answers

35 views

GNN based unsupervised Anomaly Detection for Heterogenous Graphs

I am working on a project where I am doing Unsupervised Anomaly Detection on employee expenses on HCP transfer Of Value. I am trying to use Graph Neural Network to detect anomalies with proper ...

Sanket Maiti

21

asked Jul 14 at 21:57

1 vote

0 answers

30 views

Abysmal Results on simple 1D Case. NN Architecture: Any heuristics/rule of thumb? [duplicate]

First and foremost, I am looking for a practical answer to the simplest test case, sketched below. In general I would also be interested in any motivated, rational heuristics on the optimal layout for ...

Smerdjakov

155

asked Jul 13 at 15:56

0 votes

0 answers

83 views

Is it correct to count the number of layers like this in neural networks?

I’ve seen some tutorials and papers that count the number of layers in a way that I find a bit confusing. For example, consider a model like this: ...

aliiiiiiiiiiiiiiiiiiiii

101

asked Jul 2 at 12:24

0 votes

0 answers

59 views

Does restricting weight space in Deep Learning models make training faster?

I want to know if the following problem has a name and also I'd like to get some papers to read on the subject. Suppose I have a model to learn, say $A$ and this has a huge numbers of parameters to ...

user8469759

213

asked Jun 22 at 23:33

3 votes

1 answer

129 views

How to test if a trained neural network is a linear regression?

I am in regression task and I consider a vanilla MultiLayerPerceptron $f_{\theta} : \mathcal{X} \rightarrow \mathbb{R}$ with non polynomial activation functions and the last layer is just a linear ...

arthur_elbrdn

83

asked Jun 20 at 13:49

0 votes

0 answers

44 views

Confusion on same-sign gradients problem of Sigmoid function

I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...

John

1

asked Jun 14 at 12:37

1 vote

1 answer

84 views

Are VAEs considered explainable AI?

Are VAEs considered explainable AI? To me, they are because the latent variables are interpretable, e.g, you change one and you might see its effects on the head rotation (for a dataset of faces, for ...

Link

63

asked Jun 6 at 9:08

1 vote

0 answers

37 views

Is it possible to train a neural network to reconstruct total image of an object based on partial image [closed]

Lets say that I want to train a network where the input is an image of a small part of an object. For eg: image of a building with corners and some part of exterior walls and some part of roof. I want ...

user146290

121

asked Jun 6 at 6:07

1 vote

0 answers

59 views

How many output features per hidden layer in a neural network?

I am training a neural network using R-Torch for a regression problem. My dataset has 22 features, and I currently have a neural network composed of one hidden layer and one output layer. My question ...

Adverse Effect

51

asked Jun 4 at 0:47

0 votes

0 answers

19 views

GAN training and HPO for custom dataset

I have a custom database made of bw superficial defects images. The database is quite far from classical CV Dataset, like CIFAR or ImageNet. I know form supervised Deep Learning that the correct ...

Jonny_92

161

asked Jun 3 at 7:53

3 votes

1 answer

132 views

Simple question about VAEs

I have trouble understanding the minimization of the KL divergence. In this link https://www.ibm.com/think/topics/variational-autoencoder They say, "One obstacle to using KL divergence for ...

Link

63

asked Jun 2 at 11:17

0 votes

0 answers

38 views

How to include contextual information along with input to a multi layer perceptron

Let $\mathbf{x}_k\in \mathrm{R}^{n\times 1}$ be an $n$-dimensional input to multi-layer perceptron(MLP) at time $t = k$. The output is $\mathbf{x}_{k+1}\in \mathrm{R}^{n\times 1}$ at time $t = k+1$. ...

user146290

121

asked May 29 at 22:35

4 votes

1 answer

89 views

Weight Gradient Dimensions in LSTM Backpropagation

In an LSTM(regression), the output gate is defined as: $$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$ where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix, $U_o \in \mathbb{...

Marie

135

asked May 24 at 8:58

3 votes

2 answers

125 views

Question on RNNs lookback window when unrolling

I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063 It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...

Baron Yugovich

509

asked May 23 at 18:20

0 votes

0 answers

41 views

Does Batch Normalization act as a regularizer when we don't shuffle the dataset at each epoch?

Batch Normalization (BN) is a technique to accelerate the convergence when training neural networks. It is also assumed to act as a regularizer, since the the mean and standard deviation are ...

Antonios Sarikas

881

asked May 20 at 10:38

1 vote

1 answer

59 views

Is it possible to ignore the past values of the response variable in an LSTM model with multiple predictor variables?

I have an LSTM model to predict a variable by considering multiple variables. (Say the target variable is river discharge and the predictors are rainfall, temperature, evapotranspiration etc.) There ...

DWijesena

11

asked May 15 at 8:16

4 votes

1 answer

89 views

Stochastic Gradient Descent for Multilayer Networks

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...

Machine123

49

asked May 10 at 14:13

1 vote

1 answer

46 views

FISTA Optimizer Implementation for Neural Networks with Sparse Regularization

I'm implementing a FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) optimizer in PyTorch for training neural networks with sparse regularization. My implementation doesn't seem to be working as ...

Maxou

21

asked Apr 29 at 22:34

10 votes

3 answers

2k views

Is Backpropagation faulty?

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...

Yaron

109

asked Apr 27 at 0:44

-1 votes

1 answer

74 views

Deep Learning book equation 2.54- Ian GoodFellow [closed]

What does "arg min" stand for in the following? $$c^*= \arg \min_c \|x-g(c)\|_2 \tag{2.54}$$

RSStepheni

107

asked Apr 21 at 2:36

1 vote

1 answer

98 views

Improving loss but unchanging metrics in Transformer model

Setting: I'm training a neural network for classification purposes. This neural network leverages a transformer-based architecture and leverages PU-learning. PU-learning is a setting where you solely ...

Fred

31

asked Apr 14 at 10:21

4 votes

1 answer

218 views

Do i.i.d. assumptions extend to datasets of independently generated sequences in modern sequence models (e.g., RNNs)?

In standard machine learning settings with cross-sectional data, it's common to assume that data points are independently and identically distributed (i.i.d.) from some fixed data-generating process (...

spie227

242

asked Apr 12 at 11:22

0 votes

0 answers

55 views

How can I reduce the number of runs needed to measure variability from two stochastic factors in my experiment?

I’m testing a new random data augmentation technique in a neural network. There are two main sources of randomness: Network initialization and training (e.g., random parameter initialization, ...

desert_ranger

682

asked Apr 6 at 18:53

All Questions