Text Data: Preprocessing

Yugant Hadiyal
3 min readDec 21, 2018

Clean it before use it ;)

Nowadays, we often see chatbot on roughly every popular website. This trend became so mainstream and several tech-giants have released their API to make chatbots. This kind of services mainly includes Speech to Text, Text to Speech, Natural Language Processing and other processes.

by Marketeer.co

The definition of NLP in laymen terms :

“ The branch of computer science where engineers and scientists develop some algorithms or software to make computer able to understand human languages where most used language is English”

Why it is so hard to make it a reality? The answer is ‘computers only understand binary’. Then how these all chatbot things are working? Because after years of research and development, now we have achieved some progress in this field. Now the computers understand emotions, grammar and can even guess what you are going to type.

That’s not from a sci-fi movie that is today’s reality. We all use Google for anything we are thinking about. We just type in and hit enter and we are surfing. Look at the below image.

This one is exaggerating but we do make spelling mistakes and google just ask us ‘Did you mean:’ and we click on the link or ignore it. But this is the most common use of NLP we are leveraging so far in our day to day life. It also includes autocorrecting in your smartphone’s keyboard. One more example of NLP is medium showing us that how much time it would take to read the article properly. I hope now we are clear about NLP. Let’s dive deeper.

There are numerous things to do before we actually consume the text data to train our ML models. We have to make sure that the data we are giving to the model is not containing misleading information. Even for TF-IDF, we have to make sure that we remove numbers, symbols and stop-words, etc… That will make TF-IDF implementation efficient.

For a given passage for say 1000 words. There will be so many unnecessary words in it which demand some cleaning. For an instance imagine that you are counting the repetition of each word from the passage and in this case, you might get words with most repetition like ‘a’, ‘this’, ‘the’ and ‘somebody’s name’.

Therefore we need to remove them. For that, you can use textcleaner python library. This library has a main_cleaner module which can do all the basic operations needed to carry out on the text data. At the result, you will get a clean dataset ready for crunching.

I would like to mention some issues which people in the NLP field faces often:

  • blank lines in the data
  • full stops and new line [‘\n’]
  • spelling mistakes
  • stop words
  • punctuations and numbers
  • rare words

Some routine processes:

  • lowercasing all the data
  • counting words
  • tokenizing
  • stemming and lemmatization
  • TF-IDF
  • BoW
  • Word Embedding
  • generating a dictionary

These all things were so repetitive that a python library was needed to ease it out.

More advanced preprocessing methods are sentimental analysis, aspect analysis, Word2Vec, Sent2Vec, N-gram, Skip gram and many more. Though the mentioned library does not contain all the methods at this stage, it is able to do some basic operations which can help to reduce the pain.

Thanks for your time. Visit the repo of library.

--

--

Yugant Hadiyal

Want to become a pirate in the “sea” of Data Science ☠️