2
$\begingroup$

I've been normalizing a dataset and after tokenizing my words I've noticed that some records contain combinations of words where the spaces between them are missing.

ie. The quickbrown fox jumped over thelazydog.

I was thinking about trying to work with a spell checker where if I find that the word is misspelled then I could try to figure out if there is a way to split up the string into correctly spelled words. The problem I see with this approach is with words like microsoft, which would be broken up into micro and soft.

How would you deal with this dirty dataset? Would you leave these combined words as part of the vocabulary, remove all misspelled words, try to split them up, or replace misspelled words with a placeholder The * fox jumped over *.?

$\endgroup$
2
  • 1
    $\begingroup$ No solution will be 100% correct, but a quick way to get started is to look at all the spell checker’s proposed replacements & not make the replacements when it’s clearly wrong. Another aspect is that Microsoft to “micro soft” is also a change of case (M to m), and the phrase “micro soft” is much less likely to occur than the word “Microsoft.” This gives more details (case-sensitive, probability-awareness) to differentiate among the words. The best solution is to fix whatever the cause of the missing spaces is, but perhaps that’s infeasible. $\endgroup$ Commented Jul 6, 2022 at 12:52
  • 2
    $\begingroup$ Calculate the Levenshtein distances using the ' ' as an allowable character and a standard English dictionary for reference. If "Microsoft" needs to be a word, you need to include it in your dictionary. Still can't guarantee that microasoft won't be changed to "Microsoft" versus "micro soft" unless you modify weights. $\endgroup$ Commented Jul 6, 2022 at 13:06

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.