I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063
It says "which means that you choose a number of time steps $N$, and unroll your network so that it becomes a feedforward network made of $N$ duplicates of the original network".
What is the meaning and origin of this number $N$? Is it some value you set when building the network, and if so, can I see an example in torch?
Or is it a feature of the training (optimization) algorithm?
In my mind, I think of RNNs as analogous to exponentially moving average, where past values gradually decay, but there's no sharp (discrete) window. But it sounds like there is a fixed number of $N$ that dictates the lookback window, is that the case? Or is it different for different architectures? How is this $N$ set for an LSTM vs for GRU, for example?
2 Answers
$N$ is how much temporal context the net is allowed to ingest before you use its output for a prediction or update step. Loosely speaking, it is how many timepoints the net sees before it provides a 'useful' output. It is also referred to as the sequence length or frame length.
$N$ ought to be large enough such that the model can look over enough history to make a useful prediction, and small enough such that it respects model/memory limitations and doesn't include too much irrelevant historical context. It therefore set by a combination of domain knowledge and model/resource limitations.
You would need to chunk your sequential data into smaller frames of length $N$. Each frame constitutes a unit of sequential data that the net will provide a prediction for. By "chunk" I mean produce $N$-length frames from your data. They could be non-overlapping, partially-overlapping, random crops, etc.
If your data is higher resolution than you need for the task (e.g. per-second temperature data for predicting monthly trends), then it's useful to start by aggregating or down-sampling it to something more suitable and manageable, like daily averages. Then you can chunk the down-sampled data into useful frames of length $N$. For example $N=7$ means 'use the last 7 days when making a prediction for the next month', and it would otherwise have been $N=605,000$ if you chunked the raw data directly, which is unfeasible.
Sequence-to-vector net: the net would step through $N$ timepoints before rendering a prediction $\hat{y}$. $\hat{y}$ is then used to update the model during training, or used as-is during inference. It summarises (encodes) a sequence into a single output, and is also referred to as an encoder.
Sequence-to-sequence net: the net would step through $N$ timepoints, rendering a prediction per timepoint. When it's done stepping through the $N$ timepoints, you use some or all of the $N$ outputs to compute the loss for an update step.
Vector-to-sequence net: The net just sees a single sample, but is fed its output $N$ times over to build up a sequence of length $N$. It decodes or 'unpacks' a single input into a sequence, and is also known as a decoder.
1D CNN (not an RNN, but relevant): You present it with a list of $N$ points, and it will slide a smaller kernel over that frame, only looking at
kernel_lengthpoints at a time.
[...] can I see an example in torch?
In PyTorch, each sample going into the net (for which we want a prediction) is shaped $sequence~length \times num.~features$. You would have cut your original data into many $N$-sized frames, which would then usually get fed in groups of $batch~size$ at a time rather than individually.
Batching multiple samples typically results in an input shaped $sequence~length\times batch~size\times num.~features$.
But it sounds like there is a fixed number of N that dictates the lookback window, is that the case?
$N$ is usually a fixed-length segment (a frame), and the model would make a prediction for that frame. Note that RNNs don't look over all samples at the same time - they are fed in sequentially, so the RNN just sees one new timepoint at a time. It produces an output for that timepoint, which gets fed back into the net along with the next point in the frame.
There are cases where it doesn't make sense to force a constant length. Variable-length sequences can be handled using various padding and batching techniques.
How is this $N$ set for an LSTM vs for GRU, for example?
They are usually limited to a sequence length $N$ of few hundred steps at most. This is because each step takes the previous output and feeds it back in, resulting in distortion and gradient instability.
1D CNNs work by simply sliding a smaller kernel over the timepoints, so $N$ can be much larger.
"which means that you choose a number of time steps N and unroll your network so that it becomes a feedforward network made of N duplicates of the original network".
Just so you know, an RNN has a feedback loop, so it is technically no Feed Forward Network [Ekman (Nvidia) 2024, p. 14 and 87 ff., in Ekman, M. 2022: Learning Deep Learning, Addison-Wesley (Official Nvidida Release)]. Here a picture from one of my lectures.
Regarding your questions:
What is the meaning and origin of this number N ?
To give a practical example: If you want to forecast a sales value of tomorrow, and you have 2 input values: Sales Today and Sales Yesterday, you have two steps in time, which means you unroll your RNN 2 times.
Each time we unroll an RNN we keep sharing the same weights and biases. This means we can unroll as many times as we like and the number of weights and biases we train with backpropagation stay the same.
So the N comes from how much you want to go back in time for training. We just keep unrolling until we have an input for each day of data in this example, I gave.
So the origin of N are technically your N data points you want to use as a sequence for forecasting.
Is it some value you set when building the network, and if so, can I see an example in torch?
It will be done automatically by choosing the input data, if you do not interfere somehow by yourself. As a LSTM is practically an RNN, as it uses weights from a previous input (short term memory weight and long term memory weight), just as the RNN uses the weight of a previous activation function output, this may look like this:
The short term memory and long term memory (hidden state and cell state) go as weights into the next 'day' see the function def forward. Futher, you can see that on Day 4, we use the weights (short and long term memory), that were modified through the other training days as input. Now imagine an RNN is doing the same depending on the amount of days, with the difference that you need to update the 1st weight of Day 1 by going over Day 3, 2 and 1, and weight 2 by going over day 2 and 3, which then turns into the vanishing or exploding gradient problem. Here the LSTM updates itself along the way per day, but an RNN does not do this along the way, which leads to a lot of dot products a long the way, something like this from Starmer p. 153, in Starmer 2024, The Statquest Illustrated Guide to Neural networks and AI.

(Disclaimer: The code is by Josh Starmer from StatQuest, written with Pytorch and Lightning). So the RNN needs to go back in time to update the weights, while the LSTM updates the weights along the way.
Or is it a feature of the training (optimization) algorithm? No it depends on your input size/length as can be seen here with the days
But it sounds like there is a fixed number of N that dictates the lookback window, is that the case?
Correct, N is fixed but by your input, less input sequences means, less N and less to unfold.
...past values gradually decay, but there's no sharp (discrete) window.
The weights in the RNN can be between -1 and 1 or > 1 and < -1 (like 2 or -4) Since you update weights over time, due to your input length, the last data point willl vanish since you multiply this weights several times due to the input sequence length.
Lets imagine you have 4 days training like in the LSTm and want to forecast day 5, to update the first weight from day 1 you need to multiply 4 times back in time, and with many weights being between -1 and 1 (so with 0.5*0.5=0.25) you have a very low weight at the end. Meaning Day 1 has no influence in the prediction.
So past values not gradually decay, but their influence due to the low weight, in the forecasting process.
If the weight is bigger, then you have a very big weight at Day 1 meaning, new data points have no influence, and tehir power is vanishing.
