What is the efficient way to tokenize a long string?

Question

I have a really long string. How can I efficiently identify the boundaries of a fixed token length in the text? For example: text = "Quick silver brown fox jumped over the hedge" token_window = 4 tokens Assuming 1 token = 2 characters text with token window boundary = "Quic|k si|lver| bro|wn f|ox j|umpe|d ov|er t|he h|edge|"

Since there is no fixed assumption of token and character length relation, how can we do this efficiently?

Tokenizing the long string at one go is too slow. Any alternative solution?

JP Alioto · Accepted Answer · 2024-01-09 21:42:34Z

2

You know the length of the string, token window and token length, you should be able to mathematically determine some simple boundaries like half, quarter, etc. That makes the problem easily parallelizable. Do a binary division of the string (in half, quarters, eights, however deep you need to go to get "short enough" strings) and tokenize each string in parallel then rejoin the strings.

answered Jan 9, 2024 at 21:42

JP Alioto

1863 bronze badges

Add a comment |

Stack Exchange Network

What is the efficient way to tokenize a long string?

1 Answer 1

Your Answer

Hot Network Questions

What is the efficient way to tokenize a long string?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions