NLP - How to deal with a dataset where some spaces between words are missing

Ask Question

Asked 3 years, 4 months ago

Modified 3 years, 4 months ago

Viewed 422 times

I've been normalizing a dataset and after tokenizing my words I've noticed that some records contain combinations of words where the spaces between them are missing.

ie. The quickbrown fox jumped over thelazydog.

I was thinking about trying to work with a spell checker where if I find that the word is misspelled then I could try to figure out if there is a way to split up the string into correctly spelled words. The problem I see with this approach is with words like microsoft, which would be broken up into micro and soft.

How would you deal with this dirty dataset? Would you leave these combined words as part of the vocabulary, remove all misspelled words, try to split them up, or replace misspelled words with a placeholder The * fox jumped over *.?

asked Jul 6, 2022 at 12:33

Tolure

1213 bronze badges

1

$\begingroup$ No solution will be 100% correct, but a quick way to get started is to look at all the spell checker’s proposed replacements & not make the replacements when it’s clearly wrong. Another aspect is that Microsoft to “micro soft” is also a change of case (M to m), and the phrase “micro soft” is much less likely to occur than the word “Microsoft.” This gives more details (case-sensitive, probability-awareness) to differentiate among the words. The best solution is to fix whatever the cause of the missing spaces is, but perhaps that’s infeasible. $\endgroup$

Sycorax
– Sycorax ♦

2022-07-06 12:52:02 +00:00
Commented Jul 6, 2022 at 12:52
2

$\begingroup$ Calculate the Levenshtein distances using the ' ' as an allowable character and a standard English dictionary for reference. If "Microsoft" needs to be a word, you need to include it in your dictionary. Still can't guarantee that microasoft won't be changed to "Microsoft" versus "micro soft" unless you modify weights. $\endgroup$

AdamO
– AdamO

2022-07-06 13:06:49 +00:00
Commented Jul 6, 2022 at 13:06

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

NLP - How to deal with a dataset where some spaces between words are missing

0

Your Answer

Hot Network Questions

NLP - How to deal with a dataset where some spaces between words are missing

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions