I've been normalizing a dataset and after tokenizing my words I've noticed that some records contain combinations of words where the spaces between them are missing.
ie. The quickbrown fox jumped over thelazydog.
I was thinking about trying to work with a spell checker where if I find that the word is misspelled then I could try to figure out if there is a way to split up the string into correctly spelled words. The problem I see with this approach is with words like microsoft, which would be broken up into micro and soft.
How would you deal with this dirty dataset? Would you leave these combined words as part of the vocabulary, remove all misspelled words, try to split them up, or replace misspelled words with a placeholder The * fox jumped over *.?