10 Text Cleaning Technics You need in Python

Python is popular for Text analytics. The data present in the Text has unwanted hidden data. While working on analytics, you need to remove these.


The List of unwanted things in data.

  1. Removing HTML tags
  2. Tokenization
  3. Removing unnecessary tokens and stopwords
  4. Handling contractions
  5. Correcting spelling errors
  6. Stemming
  7. Lemmatization
  8. Tagging
  9. Chunking
  10. Parsing

The First Steps in Text Analytics of Python


Data is the prime source for analytics projects. The Data is then fed to intelligent systems such as Machine Learning and Deep Learning.

1. Removing HTML tags

The unstructured text contains a lot of noise, especially if you use techniques like web scraping or screen scraping to retrieve data from web pages, blogs, and online repositories. HTML tags, JavaScript, and Iframe tags typically don't add much value to understanding and analyzing text.

Our purpose is we need to remove HTML tags and other noise.

2. Tokenization

Tokens are independent and minimal textual components that have some definite syntax and semantics. A paragraph of text or a text document has several components, including sentences, which can be further broken down into clauses, phrases, and words. 

The most popular tokenization techniques include sentence and word tokenization, which are used to break down a text document (or corpus) into sentences and each sentence into words. 

Thus, tokenization can be defined as the process of breaking down or splitting textual data into smaller and more meaningful components called tokens.

10 Text Cleaning Technics While Doing Text Analytics in Python
Text Analytics in Python 


3. Removing Unnecessary Stop Words

Stopwords are words that have little or no significance and are usually removed from a text when processing it so as to retain words having maximum significance and context. 

Stopwords usually occur most frequently if you aggregate a corpus of text based on singular tokens and checked their frequencies. Words like "a," "the," "and," and so on are stopwords.


4. Handling contractions

The best examples of contractions are you'll, it's, etc.


5. Correcting spelling errors

Auto correcting spelling errors. While doing a Google search, you will find it corrects your spelling automatically.


6. Stemming

This is a process, where you can reduce words to root level. The best example is Snowball, this you stem it to root level as Snow and Ball.


7. Lemmatization

Based on the context, bring the words to the root level, and to make them meaningful.

8. Tagging

This is the concept of group particular words under a Tag.


9. Chunking

It is of constructing from various words of Verbs, Nouns, Adjectives, etc. Checkout here on Data Chunking.


10. Parsing

It is data that is passed through some syntax rules. The out is then fed to the input for other Machine Learning systems. The syntax rules vary from project to project.


Keep Reading, in my next post I will add a small project on all these.

Comments

Popular posts from this blog

Hyperledger Fabric: 20 Real Interview Questions

How to Fix Python Syntax Errors Quickly

7 AWS Interview Questions asked in Infosys, TCS