Textcleaner : a data pre-processing library

Yugant Hadiyal
3 min readDec 19, 2018

It is hassle-free and saves time.

What lead you to this article? Is it NLP or anything related to text data? If the answer is yes then dive in.

In the field of data science, we often see people discussing that most of the data scientist is only cleaning the data and people think that this is the core part of the whole learning graph. That’s where it hit me like “that’s not true people working hard to achieve something rather than just cleaning data”. But on the contrary, I found that it is not totally wrong. People do waste their time on these things.

In my free time, I started thinking about some projects which I have done for academic purpose where I was spending a considerable amount of time on these things. Similarly, in my pursuit of being a successful data scientist, I often face this kind of similar issues.

So this time on my console I was not only cleaning the data but making a library which can serve lots of developers all over the world. I decided to solve these problems and wanted to make it available to everyone.

Thanks to medium, I found an article on how to do this. I was totally unaware about the process of listing your python module on python index. Have you ever thought about how it feels when you type into your command prompt “pip install <your_module_name>” and it works? Sounds thrilling right!

Here is this article which will help you to do so if you are interested.

Let’s talk about this utility. I have made several functions which you can use to clean your text file. I will recommend you to use the main_cleaner() because it contains all the typical methods to clean data. There are two types of methods available primary and advanced. Primary functions do the work like cleaning blank lines from the text, cleaning numbers, cleaning symbols and alike. Advanced features are like stemming, lemmatization, stop words removal and so on.

textcleaner contains total 10 methods which are described below :

Follow this link for usage and installation: link

0. main_cleaner(<FILE_NAME>,op=<options>)

Pass a text file to this function and it will return you a list of list of words. This list will be structured as a group of words per sentence. If you just want to get all the words then set ‘op’ to ‘words’ and it will return a flat list. details value of ‘op’ is ‘sents’.

Primary

  1. clear_balnk_lines(<FILE_NAME>)

This function accepts a file name as an argument and will return a list of list. Meanwhile, it will also remove unnecessary blank lines from the text.

2. strip_all(<list>)

Strips from the whole text where its default value is ‘.\n’ which means it will remove all the full stop and newline from the end of every line of the passage.

3. lower_all(<list>)

It is important to convert it to lowercase because when you will want to compare words from the data or something it will not work as you though because of the case sensitivity.

4. remove_numbers(<list>) , remove_symbols(<list>)

Removes numbers and punctuations from anywhere of the text it is fed. It uses the regular expression to achieve this.

Advanced

5. remove_stopwords(<lsit>)

When some text data is given to you which is a bit huge, let’s say it is 5 MB. In this case, you definitely don’t want any problem passing the data from one function to another. So you will need to remove words with less meaning.

For example: “Your dog is a husky” main entities here which will be useful are dog and husky.

For extracting meaningful words from the sentence, it is better to remove stop words like ‘is’, ‘a’, ‘the’, etc… and use the remaining words.

6. token_it(<list>)

It is a basic function but still an important one. token_it is a tokenizer which separates each word and makes a list from a given data. Like dividing the whole text into the small pieces of information.

7. stemming(<list>)

This function converts each word to its stem word. Like dogs to dog.

8. lemming(<list>)

Well, it’s the name of an animal. I named it lemming instead of lemmatization because this one is easy to remember. In the process of lemmatization, all the words get converted to its root word like from ‘done’ to ‘do’, ‘these’ to ‘this’ and alike.

I hope that this will help you. This library is licenced under MIT so you can use it freely.

I invite people to contribute to this mini project or for suggestions.

Thanks for your time.

--

--

Yugant Hadiyal

Want to become a pirate in the “sea” of Data Science ☠️