Featured Post

14 Top Data Pipeline Key Terms Explained

Image
 Here are some key terms commonly used in data pipelines 1. Data Sources Definition: Points where data originates (e.g., databases, APIs, files, IoT devices). Examples: Relational databases (PostgreSQL, MySQL), APIs, cloud storage (S3), streaming data (Kafka), and on-premise systems. 2. Data Ingestion Definition: The process of importing or collecting raw data from various sources into a system for processing or storage. Methods: Batch ingestion, real-time/streaming ingestion. 3. Data Transformation Definition: Modifying, cleaning, or enriching data to make it usable for analysis or storage. Examples: Data cleaning (removing duplicates, fixing missing values). Data enrichment (joining with other data sources). ETL (Extract, Transform, Load). ELT (Extract, Load, Transform). 4. Data Storage Definition: Locations where data is stored after ingestion and transformation. Types: Data Lakes: Store raw, unstructured, or semi-structured data (e.g., S3, Azure Data Lake). Data Warehous...

6 Advantages of Columnar Databases over Traditional RDBMS

In traditional RDBMS, when a data source is accessed by multi users at single time, then database will go into deadlock state.

One of the advantages of a columnar model is that if two or more users want to use a different subset of columns, they do not have to lock out each other.

(Superior benefits for NoSQL Jobs)
        (Superior benefits for NoSQL Jobs)
This design is made easier because of a disk storage method known as RAID (redundant array of independent disks, originally redundant array of inexpensive disks), which combines multiple disk drives into a logical unit. Data is stored in several patterns called levels that have different amounts of redundancy. The idea of the redundancy is that when one drive fails, the other drives can take over. When a replacement disk drive in put in the array, the data is replicated from the other disks in the array and the system is restored.

The following are the various levels of RAID:

RAID 0 (block-level striping without parity or mirroring) has no (or zero) redundancy. It provides improved performance and additional storage but no fault tolerance. It is a starting point for discussion.

In RAID 1 (mirroring without parity or striping) data is written identically to two drives, thereby producing a mirrored set; the read request is serviced by either of the two drives containing the requested data, whichever one involves the least seek time plus rotational latency.

In RAID 10 (mirroring and striping) data is written in stripes across primary disks that have been mirrored to the secondary disks.

In RAID 2 (bit-level striping with dedicated Hamming-code parity) all disk spindle rotation is synchronized, and data is striped such that each sequential bit is on a different drive. Hamming-code parity is calculated across corresponding bits and stored on at least one parity drive. This theoretical RAID level is not used in practice.

In RAID 3 (byte-level striping with dedicated parity) all disk spindle rotation is synchronized, and data is striped so each sequential byte is on a different drive.

RAID 4 (block-level striping with dedicated parity) is equivalent to RAID 5 except that all parity data is stored on a single drive. In this arrangement, files may be distributed between multiple drives.

RAID 5, RAID 6, and other patterns exist; many of them are marketing terms more than technology. The goal is to provide fault tolerance of drive failures, up to n disk drive failures or removals from the array. This makes larger RAID arrays practical, especially for high-availability systems. While this is nice for database people, we get more benefit from parallelism for queries.

Comments

Popular posts from this blog

SQL Query: 3 Methods for Calculating Cumulative SUM

Big Data: Top Cloud Computing Interview Questions (1 of 4)

How to Fix datetime Import Error in Python Quickly