Posts

Showing posts with the label Cleaning data

Featured Post

How to Work With Tuple in Python

Image
Tuple in python is one of the streaming datasets. The other streaming datasets are List and Dictionary. Operations that you can perform on it are shown here for your reference. Writing tuple is easy. It has values of comma separated, and enclosed with parenthesis '()'. The values in the tuple are immutable, which means you cannot replace with new values. #1. How to create a tuple Code: my_tuple=(1,2,3,4,5) print(my_tuple) Output: (1, 2, 3, 4, 5) ** Process exited - Return Code: 0 ** Press Enter to exit terminal #2. How to read tuple values Code: print(my_tuple[0]) Output: 1 ** Process exited - Return Code: 0 ** Press Enter to exit terminal #3. How to add two tuples Code: a=(1,6,7,8) c=(3,4,5,6,7,8) d=print(a+c) Output: (1, 6, 7, 8, 3, 4, 5, 6, 7, 8) ** Process exited - Return Code: 0 ** Press Enter to exit terminal #4.  How to count tuple values Here the count is not counting values; count the repetition of a given value. Code: sample=(1, 6, 7, 8, 3, 4, 5, 6, 7, 8) print(sample

10 Excusive Steps You need for Web Scrapping

Image
Here're ten Python technics to clean the scraped data. The scraped  Text has unwanted hidden data . So, as part of cleaning it try to remove these ten in your data. 10 Steps for Web scrapping Data is prime input for  text analytics projects . After cleaning, you can feed to Machine/Deep Learning systems. Removing HTML tags Tokenization Removing unnecessary tokens and stop-words Handling contractions Correcting spelling errors Stemming Lemmatization Tagging Chunking Parsing 10 Technics to Clean Text in Python 1. Removing HTML tags The unstructured text contains a lot of noise ( data from web pages, blogs, and online repositories.)when you use web/screen scraping.  The HTML tags, JavaScript, and Iframe tags typically don't add much value to understanding and analyzing text. Our purpose is to remove HTML tags, and other noise. 2. Tokenization Tokens are independent and minimal textual components. And have a definite syntax and semantics. A paragraph of text or a text document has