Showing posts with the label Cleaning data

Featured Post

The Quick and Easy Way to Analyze Numpy Arrays

The quickest and easiest way to analyze NumPy arrays is by using the numpy.array() method. This method allows you to quickly and easily analyze the values contained in a numpy array. This method can also be used to find the sum, mean, standard deviation, max, min, and other useful analysis of the value contained within a numpy array. Sum You can find the sum of Numpy arrays using the np.sum() function.  For example:  import numpy as np  a = np.array([1,2,3,4,5])  b = np.array([6,7,8,9,10])  result = np.sum([a,b])  print(result)  # Output will be 55 Mean You can find the mean of a Numpy array using the np.mean() function. This function takes in an array as an argument and returns the mean of all the values in the array.  For example, the mean of a Numpy array of [1,2,3,4,5] would be  result = np.mean([1,2,3,4,5])  print(result)  #Output: 3.0 Standard Deviation To find the standard deviation of a Numpy array, you can use the NumPy std() function. This function takes in an array as a par

10 Excusive Steps You need for Web Scrapping

Here're ten Python technics to clean the scraped data. The scraped  Text has unwanted hidden data . So, as part of cleaning it try to remove these ten in your data. 10 Steps for Web scrapping Data is prime input for  text analytics projects . After cleaning, you can feed to Machine/Deep Learning systems. Removing HTML tags Tokenization Removing unnecessary tokens and stop-words Handling contractions Correcting spelling errors Stemming Lemmatization Tagging Chunking Parsing 10 Technics to Clean Text in Python 1. Removing HTML tags The unstructured text contains a lot of noise ( data from web pages, blogs, and online repositories.)when you use web/screen scraping.  The HTML tags, JavaScript, and Iframe tags typically don't add much value to understanding and analyzing text. Our purpose is to remove HTML tags, and other noise. 2. Tokenization Tokens are independent and minimal textual components. And have a definite syntax and semantics. A paragraph of text or a text document has