Posts

Showing posts with the label Text

HBASE Vs. RDBMS Top Differences You can Unlock Now

Image
HBASE in the Big data context has a lot of benefits over RDBMS. The listed differences below make you understandable why HBASE is popular in Hadoop (or Bigdata) platform. Let us check one by one quickly. HBASE Vs. RDBMS Differences Random Accessing HBase handles a large amount of data that is store in a distributed manner in the column-oriented format while RDBMS is systematic storage of a database that cannot support a random manner for accessing the database. Database Rules RDBMS strictly follow Codd's 12 rules with fixed schemas and row-oriented manner of database and also follow ACID properties. HBase follows BASE properties and implement complex queries. Secondary indexes, complex inner and outer joins, count, sum, sort, group, and data of page and table can easily be accessible by RDBMS. Storage From small to medium storage application there is the use of RDBMS that provide the solution with MySQL and PostgreSQL whose size increase with concurrency and performance.  Codd'

Python Logic to Remove HTML tags from Web data

Image
HTML and XML tags are common in the raw data. I have shown the best example of how to remove HTML and XML tags using BeautifulSoup. In Python, the prime step of text analytics is cleaning . You can remove HTML tags using BeautifulSoup parser. Checkout Python Logic and removing HTML tags. When analyzing web data, consider the below examples for your projects. Python Ideas to Remove HTML tags How do I remove HTML tags using BeautifulSoup? Import BeautifulSoup Python Logic to Remove HTML tags Before and after executing the code 1. Import BeautifulSoup import BeautifulSoup from bs4 2. Python BeautifulSoup: How to Remove HTML Tags from bs4 import BeautifulSoup soup = BeautifulSoup ("<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>") text = soup. get_text() print (text) 3. Before and After Run Before Run Before Executing the code After Run Result after executing the code Bottom-lin

10 Text Cleaning Technics You Need in Python

Image
T en Python technics to clean text shared for your quick reference. Raw  Text has unwanted hidden data , and as an analytics engineer, you need to remove these hidden unwanted tags. 10 Technics to Clean Text Removing HTML tags Tokenization Removing unnecessary tokens and stopwords Handling contractions Correcting spelling errors Stemming Lemmatization Tagging Chunking Parsing 10 Technics to Clean Text in Python Data is the prime source for  text analytics projects , which will then feed to Machine Learning and Deep Learning systems. 1. Removing HTML tags The unstructured text contains a lot of noise if you use techniques like web scraping or screen scraping to retrieve data from web pages, blogs, and online repositories.  HTML tags, JavaScript, and Iframe tags typically don't add much value to understanding and analyzing text. Our purpose is we need to remove HTML tags and other noise. 2. Tokenization Tokens are independent and minimal textual components. And have a definite syntax