Posts

Showing posts with the label Cleaning data

Featured Post

SQL Query: 3 Methods for Calculating Cumulative SUM

Image
SQL provides various constructs for calculating cumulative sums, offering flexibility and efficiency in data analysis. In this article, we explore three distinct SQL queries that facilitate the computation of cumulative sums. Each query leverages different SQL constructs to achieve the desired outcome, catering to diverse analytical needs and preferences. Using Window Functions (e.g., PostgreSQL, SQL Server, Oracle) SELECT id, value, SUM(value) OVER (ORDER BY id) AS cumulative_sum  FROM your_table; This query uses the SUM() window function with the OVER clause to calculate the cumulative sum of the value column ordered by the id column. Using Subqueries (e.g., MySQL, SQLite): SELECT t1.id, t1.value, SUM(t2.value) AS cumulative_sum FROM your_table t1 JOIN your_table t2 ON t1.id >= t2.id GROUP BY t1.id, t1.value ORDER BY t1.id; This query uses a self-join to calculate the cumulative sum. It joins the table with itself, matching rows where the id in the first table is greater than or

10 Excusive Steps You need for Web Scrapping

Image
Here're ten Python technics to clean the scraped data. The scraped  Text has unwanted hidden data . So, as part of cleaning it try to remove these ten in your data. 10 Steps for Web scrapping Data is prime input for  text analytics projects . After cleaning, you can feed to Machine/Deep Learning systems. Removing HTML tags Tokenization Removing unnecessary tokens and stop-words Handling contractions Correcting spelling errors Stemming Lemmatization Tagging Chunking Parsing 10 Technics to Clean Text in Python 1. Removing HTML tags The unstructured text contains a lot of noise ( data from web pages, blogs, and online repositories.)when you use web/screen scraping.  The HTML tags, JavaScript, and Iframe tags typically don't add much value to understanding and analyzing text. Our purpose is to remove HTML tags, and other noise. 2. Tokenization Tokens are independent and minimal textual components. And have a definite syntax and semantics. A paragraph of text or a text document has