tldr
Is it a good idea to truncate test sequences to max_length=100? (I'm concerned that important context may be lost.)
No. RNNs can handle variable-length sequences, so there does not seem to be any reason why you would need or want to truncate your input. Your concern of losing important context is valid, and you should ideally never need to truncate.
Are there any better methods to deal with this situation?
Yes. You can either not pad at all, but then you'd need to process all inputs one-by-one (or only inputs of the same length). Or, if you do pad, make sure to use a sufficient sequence length. You can do this per batch and pick the size of the longest sequence in the batch.
more context
RNNs do not have a hard limit on sequence length. They process inputs token by token, so in theory, they can handle sequences of arbitrary length, so no padding is strictly required.
So why would we want to use padding at all?
- Batching efficiency: Frameworks like TensorFlow and PyTorch require inputs in a batch to have the same shape.
- GPU acceleration: Uniform tensor sizes allow for vectorized operations.
So, either you can skip the padding step, as it's not technically required, or you can keep padding, but then make sure that the maximum sequence length is long enough to avoid truncating part of your input. The only reason against choosing larger sequence length for your padding step could be concerns that your tensors would be too big and it would cost too much memory, but in most cases, it isn't a concern, so feel free to make the maximum sequence length as large as needed.
You can also technically pad each batch during training (or testing) to the length of the longest sequence in that batch.
Sidenote: Investigate your data
This might be me being overzealous, but if your training data doesn't have any sequences shorter than 100 but your testing data does, it could imply that there might be some differences between both sets.
These differences might not be an issue or something that you could technically fix, but it's good to do some additional checks on your data to ensure that you are indeed training on data that is representative of data you will encounter "in the wild".
max_length, usepad_sequence, and loop over the chunks in the usual way. What part of your question is not answered? $\endgroup$