5
$\begingroup$

As the part of my college project on RNN, I'm working on a text classification task using tensorflow module. During training, I used pad_sequences with a max_length of 100, so all training examples were padded to length 100 as below:

padded_sequences = pad_sequences(sequences, maxlen=100, padding='post)

Now, while testing the model, I’ve encountered sentences that are longer than 100 words.

I need help on following questions:

Is it a good idea to truncate test sequences to max_length=100? (I'm concerned that important context may be lost.)

Are there any better methods to deal with this situation?

$\endgroup$
2
  • $\begingroup$ When you asked this question on stats.SE, I pointed out that you can either do nothing and just process sequences without padding or chunk the data by max_length, use pad_sequence, and loop over the chunks in the usual way. What part of your question is not answered? $\endgroup$ Commented Aug 1 at 8:40
  • $\begingroup$ The stats.SE post: stats.stackexchange.com/q/669170/232706 $\endgroup$ Commented Aug 1 at 13:58

1 Answer 1

4
$\begingroup$

tldr

Is it a good idea to truncate test sequences to max_length=100? (I'm concerned that important context may be lost.)

No. RNNs can handle variable-length sequences, so there does not seem to be any reason why you would need or want to truncate your input. Your concern of losing important context is valid, and you should ideally never need to truncate.

Are there any better methods to deal with this situation?

Yes. You can either not pad at all, but then you'd need to process all inputs one-by-one (or only inputs of the same length). Or, if you do pad, make sure to use a sufficient sequence length. You can do this per batch and pick the size of the longest sequence in the batch.

more context

RNNs do not have a hard limit on sequence length. They process inputs token by token, so in theory, they can handle sequences of arbitrary length, so no padding is strictly required.

So why would we want to use padding at all?

  • Batching efficiency: Frameworks like TensorFlow and PyTorch require inputs in a batch to have the same shape.
  • GPU acceleration: Uniform tensor sizes allow for vectorized operations.

So, either you can skip the padding step, as it's not technically required, or you can keep padding, but then make sure that the maximum sequence length is long enough to avoid truncating part of your input. The only reason against choosing larger sequence length for your padding step could be concerns that your tensors would be too big and it would cost too much memory, but in most cases, it isn't a concern, so feel free to make the maximum sequence length as large as needed.

You can also technically pad each batch during training (or testing) to the length of the longest sequence in that batch.

Sidenote: Investigate your data

This might be me being overzealous, but if your training data doesn't have any sequences shorter than 100 but your testing data does, it could imply that there might be some differences between both sets.

These differences might not be an issue or something that you could technically fix, but it's good to do some additional checks on your data to ensure that you are indeed training on data that is representative of data you will encounter "in the wild".

$\endgroup$
0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.