All Questions
Tagged with deep-learning or neural-networks
9,985 questions
0
votes
0
answers
40
views
Independence and Correlation Structure of Weights Generated by a Hypernetwork
Suppose a hypernetwork $\mathcal{H}$ takes a latent variable $z \sim p_z(z)$,
where $p_z$ is Gaussian, and outputs the parameters of another neural network $f$.
In particular, each weight $w_i$ of $f$ ...
0
votes
0
answers
19
views
Why does batch normalization make lower layers 'useless' in purely linear networks?
I'm reading the Deep Learning book by Goodfellow, Bengio, and Courville (Chapter 8 section 8.7.1 on Batch Normalization, page 315). The authors use a simple example of a deep linear network without ...
1
vote
0
answers
24
views
Is there a work on trying to pretrain RNNs model? [closed]
I really want to play around with RNNs. Trying to build an AI assistant with RNNs to run on my machine as I'm always obsessed with RNNs model...
To make the performance good, I think I need to do some ...
0
votes
0
answers
9
views
Are there any other powerful optimization tools available besides the ABC and PSO algorithms? [duplicate]
What are other optimization tools that are powerful enough to improve the accuracy performance of the neural network model? Please give me recent tools that are powerful
0
votes
0
answers
17
views
How to compare WT vs mutant predictions with MC Dropout ensemble (M=5, T=100) in a binary classifier?
I’m using an ensemble of M = 5 deep neural networks, each evaluated with T = 100 Monte Carlo dropout samples at test time to estimate predictive uncertainty.
The model performs binary classification (...
0
votes
0
answers
26
views
Is it a bad idea to use Transformer models on long-tailed datasets?
I’m working on a video classification task with a long-tailed dataset where a few classes have many samples while most classes have very few.
More specifically, my dataset has around 9k samples and 3....
1
vote
0
answers
26
views
Does compositional structure (actually) mitigate the curse of dimensionality?
The paper "Deep Quantile Regression: Mitigating the Curse of Dimensionality Through Composition" makes the following claim (top of page 4):
It is clear that smoothness is not the right ...
2
votes
0
answers
25
views
What causes the degradation problem - the higher training error in much deeper networks?
In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that
"When deeper networks are able to start converging, a degradation problem has been exposed: with ...
0
votes
0
answers
24
views
What is the expected ideal values for the losses of discrimintor when using generative adversarial imputaiton network to impute missing values?
I am new to GAIN (generative adversarial imputation network). I am trying to use GAIN to impute missing values. I have a quesiton about the values of the losses for the discriminator. Are the values ...
0
votes
0
answers
40
views
Multiplying probabilities of weights in Bayesian neural networks to formulate a prior
A key element in Bayesian neural networks is finding the probability of a set of weights, so that it can be applied to Bayes rule.
I cannot think of many ways of doing this, for P(w) (also sometimes ...
1
vote
1
answer
58
views
Bayes-by-backprop - meaning of partial derivative
The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm:
Note that the $\frac{∂f(w,θ)}{∂w}$
term of the gradients for the mean and
standard ...
1
vote
1
answer
113
views
Function omitted during formula derivation (KL-divergence)
From the above, I am trying to derive the below:
However, I do not see why the $q_\theta(w)$ has been omitted from $\log p(D)$, in equation 17 and 18.
Here is my attempt to derive the above:
$$\begin{...
3
votes
0
answers
60
views
Normalizing observations in a nonlinear state space model
I am modelling the the sequence $\{(a_t,y_t)\}_t$ as follows:
$$
\begin{cases}
Y_{t+1} &= g_\nu(X_{t+1}) + \alpha V_{t+1}\\
X_{t+1} &= X_t + \mu_\xi(a_t) + \sigma_\psi(a_t)Z_{t+1}\\
X_0 &= ...
0
votes
0
answers
65
views
Why is one-hot encoding used in RL instead of binary encoding?
Basically, the question above: in RL, people typically encode the state as a tensor consisting of a plane with "channels", i.e. original Alpha Zero paper. These channels are typically one-...
0
votes
0
answers
38
views
Why do flow neural networks that are trained to only simulate vector fields for specific timesteps perform poorly compared to regular models?
I am currently learning about flow matching models and wanted to test whether or not we could train a flow matching model on just two time steps 0 and 0.5 and sampling at only those two time steps to ...
1
vote
0
answers
77
views
Do you need paired data to train multimodal?
I have video, audio, and text data. The intent is to use the multimodal for binary classification.
However, the data is not paired (i.e The audio and text are not from the same video recording).
I've ...
0
votes
0
answers
26
views
Does SWA imply high fisher info is bad?
Stochastic-Weight-Averaging (SWA) claims that deep learning MLE points in "flatter loss regions" improve generalization to holdout data. This is a famous paper in deep learning with 2000+ ...
0
votes
0
answers
47
views
How do you pick the best model out of a set of models if you knew both the training error and validation error of each model?
An interesting question I stumbled upon today is this:
Suppose I train models $m_1, m_2, m_3, \ldots, m_N$, where each $m_i$, $i = 1, \ldots, N$ is associated with a hyperparameter $i$. All models are ...
1
vote
1
answer
136
views
What is the current consensus on "using test set as training set, post testing"? [duplicate]
This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning.
The gist of it is that ...
0
votes
0
answers
34
views
Training process for LSTM based sequence labelling
I'm training an LSTM to predict a binary anomaly sequence from multi-dimensional, irregularly sampled input sequences. While CNNs perform adequately, I'm struggling to get good performance from my ...
2
votes
0
answers
67
views
Preventing data leakage when using street-level aggregated features in classification
I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...
1
vote
0
answers
32
views
PPO-like approach but with search algorithm
I'm developing an AI for a 1v1 game. I have already programmed a system for generating these rewards.
Currently, I have some heuristics and am using linear weights tuned with a genetic algorithm to ...
4
votes
1
answer
92
views
Proving universal approximation through ReLU
We were discussing universal approximation theorems for neural networks and showed that the triangular function
$$
h(x) =
\begin{cases}
x+1, & x \in [-1,0] \\
1-x, & x \in [0,1] \\
0, & \...
1
vote
0
answers
29
views
Addressing underfitting in pytorch video classification model
I am building a model to classify short videos of a person doing sign language (1-2 seconds, 30 fps, 512x512) into n labels
I find that no matter which model I use (transformers, 3D CNN, ...) or ...
0
votes
1
answer
97
views
Why do most RAG pipelines fail at multi-step reasoning, even with correct chunk retrieval?
I’ve seen many retrieval-augmented generation (RAG) pipelines return highly relevant context chunks — and yet fail catastrophically on multi-hop reasoning.
For example, even when the source document ...
1
vote
0
answers
93
views
KL divergence and deep learning paradigm
My question is regarding the paradigm of deep learning, I do not get where does the cost functions come from? For example for a classification task are we treating the encoder as the expected value of ...
0
votes
0
answers
69
views
Proving Convergence of Mean and Variance in a Recursive Gaussian Update Process
I'm researching the statistical convergence properties of a recursive system that arises during the training of custom neural network structure.
My specific question is: How can I prove convergence of ...
2
votes
0
answers
35
views
GNN based unsupervised Anomaly Detection for Heterogenous Graphs
I am working on a project where I am doing Unsupervised Anomaly Detection on employee expenses on HCP transfer Of Value. I am trying to use Graph Neural Network to detect anomalies with proper ...
1
vote
0
answers
30
views
Abysmal Results on simple 1D Case. NN Architecture: Any heuristics/rule of thumb? [duplicate]
First and foremost, I am looking for a practical answer to the simplest test case, sketched below.
In general I would also be interested in any motivated, rational heuristics on the optimal layout for ...
0
votes
0
answers
83
views
Is it correct to count the number of layers like this in neural networks?
I’ve seen some tutorials and papers that count the number of layers in a way that I find a bit confusing.
For example, consider a model like this:
...
0
votes
0
answers
59
views
Does restricting weight space in Deep Learning models make training faster?
I want to know if the following problem has a name and also I'd like to get some papers to read on the subject.
Suppose I have a model to learn, say $A$ and this has a huge numbers of parameters to ...
3
votes
1
answer
129
views
How to test if a trained neural network is a linear regression?
I am in regression task and I consider a vanilla MultiLayerPerceptron $f_{\theta} : \mathcal{X} \rightarrow \mathbb{R}$ with non polynomial activation functions and the last layer is just a linear ...
0
votes
0
answers
44
views
Confusion on same-sign gradients problem of Sigmoid function
I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...
1
vote
1
answer
84
views
Are VAEs considered explainable AI?
Are VAEs considered explainable AI? To me, they are because the latent variables are interpretable, e.g, you change one and you might see its effects on the head rotation (for a dataset of faces, for ...
1
vote
0
answers
37
views
Is it possible to train a neural network to reconstruct total image of an object based on partial image [closed]
Lets say that I want to train a network where the input is an image of a small part of an object. For eg: image of a building with corners and some part of exterior walls and some part of roof. I want ...
1
vote
0
answers
59
views
How many output features per hidden layer in a neural network?
I am training a neural network using R-Torch for a regression problem. My dataset has 22 features, and I currently have a neural network composed of one hidden layer and one output layer.
My question ...
0
votes
0
answers
19
views
GAN training and HPO for custom dataset
I have a custom database made of bw superficial defects images. The database is quite far from classical CV Dataset, like CIFAR or ImageNet. I know form supervised Deep Learning that the correct ...
3
votes
1
answer
132
views
Simple question about VAEs
I have trouble understanding the minimization of the KL divergence.
In this link https://www.ibm.com/think/topics/variational-autoencoder
They say, "One obstacle to using KL divergence for ...
0
votes
0
answers
38
views
How to include contextual information along with input to a multi layer perceptron
Let $\mathbf{x}_k\in \mathrm{R}^{n\times 1}$ be an $n$-dimensional input to multi-layer perceptron(MLP) at time $t = k$. The output is $\mathbf{x}_{k+1}\in \mathrm{R}^{n\times 1}$ at time $t = k+1$. ...
4
votes
1
answer
89
views
Weight Gradient Dimensions in LSTM Backpropagation
In an LSTM(regression), the output gate is defined as:
$$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$
where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix,
$U_o \in \mathbb{...
3
votes
2
answers
125
views
Question on RNNs lookback window when unrolling
I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063
It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...
0
votes
0
answers
41
views
Does Batch Normalization act as a regularizer when we don't shuffle the dataset at each epoch?
Batch Normalization (BN) is a technique to accelerate the convergence when training neural networks. It is also assumed to act as a regularizer, since the the mean and standard deviation are ...
1
vote
1
answer
59
views
Is it possible to ignore the past values of the response variable in an LSTM model with multiple predictor variables?
I have an LSTM model to predict a variable by considering multiple variables. (Say the target variable is river discharge and the predictors are rainfall, temperature, evapotranspiration etc.) There ...
4
votes
1
answer
89
views
Stochastic Gradient Descent for Multilayer Networks
I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...
1
vote
1
answer
46
views
FISTA Optimizer Implementation for Neural Networks with Sparse Regularization
I'm implementing a FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) optimizer in PyTorch for training neural networks with sparse regularization. My implementation doesn't seem to be working as ...
10
votes
3
answers
2k
views
Is Backpropagation faulty?
Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...
-1
votes
1
answer
74
views
Deep Learning book equation 2.54- Ian GoodFellow [closed]
What does "arg min" stand for in the following?
$$c^*= \arg \min_c \|x-g(c)\|_2 \tag{2.54}$$
1
vote
1
answer
98
views
Improving loss but unchanging metrics in Transformer model
Setting:
I'm training a neural network for classification purposes. This neural network leverages a transformer-based architecture and leverages PU-learning. PU-learning is a setting where you solely ...
4
votes
1
answer
218
views
Do i.i.d. assumptions extend to datasets of independently generated sequences in modern sequence models (e.g., RNNs)?
In standard machine learning settings with cross-sectional data, it's common to assume that data points are independently and identically distributed (i.i.d.) from some fixed data-generating process (...
0
votes
0
answers
55
views
How can I reduce the number of runs needed to measure variability from two stochastic factors in my experiment?
I’m testing a new random data augmentation technique in a neural network. There are two main sources of randomness:
Network initialization and training (e.g., random parameter
initialization, ...