3

I have a really long string. How can I efficiently identify the boundaries of a fixed token length in the text? For example: text = "Quick silver brown fox jumped over the hedge" token_window = 4 tokens Assuming 1 token = 2 characters text with token window boundary = "Quic|k si|lver| bro|wn f|ox j|umpe|d ov|er t|he h|edge|"

Since there is no fixed assumption of token and character length relation, how can we do this efficiently?

Tokenizing the long string at one go is too slow. Any alternative solution?

1 Answer 1

2

You know the length of the string, token window and token length, you should be able to mathematically determine some simple boundaries like half, quarter, etc. That makes the problem easily parallelizable. Do a binary division of the string (in half, quarters, eights, however deep you need to go to get "short enough" strings) and tokenize each string in parallel then rejoin the strings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.