I have a really long string. How can I efficiently identify the boundaries of a fixed token length in the text? For example: text = "Quick silver brown fox jumped over the hedge" token_window = 4 tokens Assuming 1 token = 2 characters text with token window boundary = "Quic|k si|lver| bro|wn f|ox j|umpe|d ov|er t|he h|edge|"
Since there is no fixed assumption of token and character length relation, how can we do this efficiently?
Tokenizing the long string at one go is too slow. Any alternative solution?