Featured post

3 Top Books Every Analytics Engineer to Read

Many of the analytics jobs nowadays are for the financial domain. The top financial domains are Banking, Payments, and credit cards. 
The Best Books are on:
SASUNIXPython

The skills you need to work in data analytics are SAS, UNIX, Python, and JavaScript.  I have selected three books for beginners of data analysts. 

1. SAS best book 
I found one best book that is little SAS. This post covers almost all examples and critical macros you need for your job.

The best-selling Little SAS Book just got even better. Readers worldwide study this easy-to-follow book to help them learn the basics of SAS programming.

Now Rebecca Ottesen has teamed up with the original authors, Lora Delwiche, and Susan Slaughter, to provide a new way to challenge and improve your SAS skills through thought-provoking questions, exercises, and projects.
2. UNIX best book
The basic commands you will get everywhere. The way of executing Macros or shell scripts is really you need. This is a good book so that you can automate…

Hadoop Top Questions on Architecture

The hadoop.apache.org web site defines Hadoop as "a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models." Quite simply, that's the philosophy: to provide a framework that's simple to use, can be scaled easily, and provides fault tolerance and high availability for production usage.

The idea is to use existing low-cost hardware to build a powerful system that can process petabytes of data very efficiently and quickly.

More : Top selected Hadoop Interview Questions

Hadoop achieves this by storing the data locally on its DataNodes and processing it locally as well. All this is managed efficiently by the NameNode, which is the brain of the Hadoop system. All client applications read/write data through NameNode.

Hadoop has two main components: the Hadoop Distributed File System (HDFS) and a framework for processing large amounts of data in parallel using the MapReduce paradigm

HDFS

HDFS is a distributed file system layer that sits on top of the native file system for an operating system. For example, HDFS can be installed on top of ext3, ext4, or XFS file systems for the Ubuntu operating system.

It provides redundant storage for massive amounts of data using cheap, unreliable hardware. At load time, data is distributed across all the nodes. That helps in efficient MapReduce processing. HDFS performs better with a few large files (multi-gigabytes) as compared to a large number of small files, due to the way it is designed.

Files are "write once, read multiple times." Append support is now available for files with the new version, but HDFS is meant for large, streaming reads—not random access. High sustained throughput is favored over low latency.

Files in HDFS are stored as blocks and replicated for redundancy or reliability. By default, blocks are replicated thrice across DataNodes; so three copies of every file are maintained. Also, the block size is much larger than other file systems. For example, NTFS (for Windows) has a maximum block size of 4KB and Linux ext3 has a default of 4KB.

Compare that with the default block size of 64MB that HDFS uses.

Name Node

NameNode (or the "brain") stores metadata and coordinates access to HDFS. Metadata is stored in NameNode's RAM for speedy retrieval and reduces the response time (for NameNode) while providing addresses of data blocks. This configuration provides simple, centralized management—and also a single point of failure (SPOF) for HDFS. In previous versions, a Secondary NameNode provided recovery from NameNode failure; but current version provides capability to cluster a Hot Standby (where the standby node takes over all the functions of NameNode without any user intervention) node in Active/Passive configuration to eliminate the SPOF with NameNode and provides NameNode redundancy.

Since the metadata is stored in NameNode's RAM and each entry for a file (with its block locations) takes some space, a large number of small files will result in a lot of entries and take up more RAM than a small number of entries for large files.

Also, files smaller than the block size (smallest block size is 64 MB) will still be mapped to a single block, reserving space they don't need; that's the reason it's preferable to use HDFS for large files instead of small files.

Comments

Popular posts from this blog

Quick Comparison AWS Vs Azure Load Balancer

Hyperledger Fabric: 20 Real Interview Questions

10 Best Visualization Charts to Present data

JavaScript Vs JSON Top Differences