10 Text Cleaning Technics You need in Python
Python is popular for Text analytics. The data present in the Text has unwanted hidden data. While working on analytics, you need to remove these.
The List of unwanted things in data.
- Removing HTML tags
- Tokenization
- Removing unnecessary tokens and stopwords
- Handling contractions
- Correcting spelling errors
- Stemming
- Lemmatization
- Tagging
- Chunking
- Parsing
The First Steps in Text Analytics of Python
Data is the prime source for analytics projects. The Data is then fed to intelligent systems such as Machine Learning and Deep Learning.
1. Removing HTML tags
Our purpose is we need to remove HTML tags and other noise.
2. Tokenization
The most popular tokenization techniques include sentence and word tokenization, which are used to break down a text document (or corpus) into sentences and each sentence into words.
Thus, tokenization can be defined as the process of breaking down or splitting textual data into smaller and more meaningful components called tokens.
![]() |
Text Analytics in Python |
3. Removing Unnecessary Stop Words
Stopwords usually occur most frequently if you aggregate a corpus of text based on singular tokens and checked their frequencies. Words like "a," "the," "and," and so on are stopwords.
4. Handling contractions
The best examples of contractions are you'll, it's, etc.
5. Correcting spelling errors
Auto correcting spelling errors. While doing a Google search, you will find it corrects your spelling automatically.
6. Stemming
This is a process, where you can reduce words to root level. The best example is Snowball, this you stem it to root level as Snow and Ball.
7. Lemmatization
Based on the context, bring the words to the root level, and to make them meaningful.
8. Tagging
This is the concept of group particular words under a Tag.
9. Chunking
It is of constructing from various words of Verbs, Nouns, Adjectives, etc. Checkout here on Data Chunking.
10. Parsing
It is data that is passed through some syntax rules. The out is then fed to the input for other Machine Learning systems. The syntax rules vary from project to project.
Keep Reading, in my next post I will add a small project on all these.
Comments
Post a Comment
Thanks for your message. We will get back you.