Skip to main content


Showing posts from September, 2015

6 Advantages of Columnar Databases over Traditional RDBMS

In traditional RDBMS, when a data source is accessed by multi users at single time, then database will go into deadlock state.
One of the advantages of a columnar model is that if two or more users want to use a different subset of columns, they do not have to lock out each other.

This design is made easier because of a disk storage method known as RAID (redundant array of independent disks, originally redundant array of inexpensive disks), which combines multiple disk drives into a logical unit. Data is stored in several patterns called levels that have different amounts of redundancy. The idea of the redundancy is that when one drive fails, the other drives can take over. When a replacement disk drive in put in the array, the data is replicated from the other disks in the array and the system is restored.

The following are the various levels of RAID:

RAID 0 (block-level striping without parity or mirroring) has no (or zero) redundancy. It provides improved performance and additional…

Top features of Apache Avro in Hadoop eco-System

Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.

The Hadoop ecosystem includes a new binary data serialization system — Avro. 
Avro provides: ·Rich data structures.

·A compact, fast, binary data format.
·A container file, to store persistent data.
·Remote procedure call (RPC).
·Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
Its functionality is similar to the other marshaling systems such as Thrift, Protocol Buffers, and so on.
The main differentiators of Avro include the following:
Dynamic typing — The Avro implementation always keeps data and its corresponding schema together. As a result, marshaling/unmarshaling operations do not require either code generation or static data types. This also allows…

AWS -Distributed Key-value storage

Distributed key-value storage, in contrast to object storage, provides structured storage that is somewhat akin to a database but different in important ways in order to provide additional scalability and performance.
Perhaps you've already used a relational database management system — a storage product that's commonly referred to as RDBMS. Its rows of data have one or more keys (hence the name key-value storage) that support manipulation of the data. 
Though RDBMS systems are fantastically useful, they typically face challenges in scaling beyond a single server. Newer distributed key-value storage products are designed from the get-go to support huge amounts of data by spreading across multiple (perhaps thousands of) servers.
Key-value storage systems often make use of redundancywithin hardware resources to prevent outages; this concept is important when you're running thousands of servers, because they're bound to suffer hardware breakdowns. Without redundancy, the ent…

Amazon web services -Object Storage

Object Storage:
Object storage provides the ability to store, well, objects — which are essentially collections of digital bits. Those bits may represent a digital photo, an MRI scan, a structured document such as an XML file — or the video of your cousin's embarrassing attempt to ride a skateboard down the steps at the public library (the one you premiered at his wedding).

Object storage offers the reliable (and highly scalable) storage of collections of bits, but imposes no structure on the bits.

The structure is chosen by the user, who needs to know, for example, whether an object is a photo (which can be edited), or an MRI scan (which requires a special application for viewing it). The user has to know both the format as well as the manipulation methods of the object. The object storage service simply provides reliable storage of the bits.

Difference between Object storage and File storage

Object storage differs from file storage, which you may be more familiar with from using …

2 Awesome differences of SOAP and REST in Web Services

SOAP is based on a document encoding standard known as Extensible Markup Language (XML, for short), and the SOAP service is defined in such a way that users can then leverage XML no matter what the underlying communication network is. For this system to work, though, the data transferred by SOAP (commonly referred to as the payload) also needs to be in XML format.

Notice a pattern here? The push to be comprehensive and flexible (or, to be all things to all people) plus the XML payload requirement meant that SOAP ended up being quite complex, making it a lot of work to use properly. As you might guess, many IT people found SOAP daunting and, consequently, resisted using it.

About a decade ago, a doctoral student defined another web services approach as part of his thesis:

REST - Representational State Transfer, which is far less comprehensive than SOAP, aspires to solve fewer problems. It doesn't address some aspects of SOAP that seemed important but that, in retrospect, made it mor…

The best differences SQL and NOSQL new database

Why SQL Does Not Work Here:
Data is not in one machine or even one network.Data can be any type public data and private dataHuge volume of data so you cannot put it in one place.It is uncoordinated in time as well as space.It is not always nice, structured data that SQL was meant to handle.

What is CompTIA Cloud+ Certification

The CompTIA Cloud+ certification is an internationally recognized validation of the knowledge required of IT practitioners working in cloud computing environments.

This exam will certify that the successful candidate has the knowledge and skills required to understand standard cloud terminology and methodologies to implement, maintain, and deliver cloud technologies and infrastructures (e.g., server, network, storage, and visualization technologies); and to understand aspects of IT security and use of industry best practices related to cloud implementations and the application of virtualization.

Related:Cloud Computing+Jobs

Cloud+ certified professionals ensure that proper security measures are maintained for cloud systems, storage, and platforms to mitigate risks and threats while ensuring usability. The exam is geared toward IT professionals with 24 to 36 months of experience in IT networking, network storage, or data center administration. It is recommended that CompTIA Cloud+ cand…

Oracle 12C 'Bitmap Index' benefits over B-tree Index

A bitmap index has a significantly different structure from a B-tree index in the leaf node of the index. It stores one string of bitsfor each possible value (the cardinality) of the column being indexed.

Note: One string of BITs means -Each tupple of possible value it assigns '1' bit in a string.So, all the BITs become a string ( This is an example, on which column you created BIT map index)
The length of the string of bits is the same as the number of rows in the table being indexed.

In addition to saving a tremendous amount of space compared to traditional indexes, a bitmap index can provide dramatic improvements in response time because Oracle can quickly remove potential rows from a query containing multiple WHERE clauses long before the table itself needs to be accessed.

Multiple bitmaps can use logical AND and OR operations to determine which rows to access from the table.

Although you can use a bitmap index on any column in a table, it is most efficient when the column…

What is Elastic Nature in Cloud Computing

Natural clouds are indeed elastic, expanding and contracting based on the force of the winds carrying them. The cloud is similarly elastic, expanding and shrinking based on resource usage and cloud tenant resource demands. The physical resources (computing, storage, networking, etc.) deployed within the data center or across data centers and bundled as a single cloud usually do not change that fast.
This elastic nature, therefore, is something that is built into the cloud at the software stack level, not the hardware.Best cloud computing example: The classic promise of the cloud is to make compute resources available on demand, which means that theoretically, a cloud should be able to scale as a business grows and shrink as the demand diminishes. Consider here, for example, during Black Friday. There's a spike in inbound traffic, which translates into more memory consumption, increased network density, and increased compute resource utilization. If had, let&#…

Essential features of Hadoop Data joins (1 of 2)

Limitation of map side joining:A record being processed by a mapper may be joined with a record not easily accessible (or even located) by that mapper. This is main limitation.

Who will facilitate map side join:

Hadoop's apache.hadoop.mapred.join package contains helper classes to facilitate this map side join.

What is joining data in Hadoop:

You will come across, you need to analyze data from multiple sources, this scenario Hadoop follows data joining. In the case database world, joining of two or more tables is called joining. In Hadoop joining data involved different approaches.

Reduce side joinReplicated joins using Distributed cacheSemijoin-Reduce side join with map side filteringWhat is functionality of Map reduce job:
The traditional MapReduce job reads a set of input data, performs some transformations in the map phase, sorts the results, performs another transformation in the reduce phase, and writes a set of output data. The sorting stage requires data to be tran…

How to verify SSH is installed in Hadoop Cluster

The following command helps, whether SSH is installed or not on your Hadoop cluster.

[hadoop-user@master]$ which ssh
[hadoop-user@master] $ which sshd
[hadoop-user@master] $ which ssh -keygen

If you do not get proper response as above. That means that SSH is not installed on your cluster.


If you receive an error message

/user/bin/which: no ssh in (/user/bin: /user/sbin....)

You need to install open SSH ( vial Linux package manager. Or by downloading the source directly.

Note: This is usually done by System Admin.

Hadoop File system 'help' command

Some times as a Hadoop developer it is difficult to remember all the Hadoop commands. So you by giving the below command you can see all the commands.

hadoop fs   ==> Enter

This will list all the Hadoop commands.

How Hadoop HDFS commands are developed:

Basically, Hadoop commands are flavour of UNIX.

If you want to see each Command description, you can go for Hadoop help command. You can use the below command for help.

hadoop fs -help ls

Deleting a File in Hadoop HDFS:

The below command helps   how to delete a file from Hadop cluster.

hadoop  fs -rm example.txt

Why Amazon Web services AWS Cloud computing is so popular

You may be forgiven if you're puzzled about how Amazon, which started out as an online bookstore, has become the leading cloud computing provider.
Amazon its Cloud computing services started in three stages: S3 (Simple storage service)SQS (Simple Que service)EC2 (Elastic compute cloud)Amazon Web Services was officially revealed to the world on March 13, 2006. On that day, AWS offered the Simple Storage Service, its first service. (As you may imagine, Simple Storage Services was soon shortened to S3.) The idea behind S3 was simple: It could offer the concept of object storage over the web, a setup where anyone could put an object — essentially, any bunch of bytes — into S3. Those bytes may comprise a digital photo or a file backup or a software package or a video or audio recording or a spreadsheet file or — well, you get the idea. S3 was relatively limited when it first started out. Though objects could, admittedly, be written or read from anywhere, they could be store…

Top 100 Hadoop Complex interview questions (Part 4 of 4)

What is MapReduce? It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.
What are ‘maps’ and ‘reduces’? ‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate akey value pair,that is, an intermediate output in local machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.
What are the four basic parameters of a mapper? The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.
What are the four basic parameters of a reducer? The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate output parameters and the second two represent final output param…

Essential features of Cloudera Impala

Cloudera Impala is a request mechanism that runs on Apache Hadoop.

The program was proclaimed in October 2012 with a common beta trial dispersion.

The Apache-licensed Impala program begets scalable collateral database technics to Hadoop, authorizing consumers to subject low-latency SQL requests to information kept in HDFS and Apache HBase short of needing information motion either alteration.

Impala is amalgamated with Hadoop to employ the similar file and information setups, metadata, safeguarding and asset administration architectures applied by MapReduce, Apache Hive, Apache Pig and different Hadoop code.

Impala is advanced for experts and information experts in science to accomplish systematic computational analysis of data or statistics on information kept in Hadoop through SQL either trade intellect implements. The effect is that extensive information handling (via MapReduce) and two-way requests may be completed on the similar configuration utilizing the similar information and …

Top 100 Hadoop Complex Interview Questions (Part 3 of 4)

What are the features of Stand alone (local) mode?

In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the most least used environments.

What are the features of Pseudo mode?

Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on the same machine.

Can we call VMs as pseudos?

No, VMs are not pseudos because VM is something different and pesudo is very specific to Hadoop.

What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running and another host on which datanode is running and then there are machines on which task tracker is running. We have separate masters and se…

What is 'SAP HANA" stands for: Best definition

What is SAP HANA?
HANA stands for High-Performance Analytic Appliance. SAP HANA is a combination of hardware and software, and is therefore an appliance.

SAP HANA supports column- and row-level storage. We can store and perform analytics on a huge amount of real-time, non-aggregated transactional data. Hence, HANA acts as both a database and a warehousing tool, which helps in making decisions at the right time. 

Challenges in Traditional RDBMS?
There are a few challenges in traditional databases, such as latency, the cost involved, and complexity in accessing databases.

Related:SAP HANA jobs and career options

What is Architecture of traditional RDBMS?

Presentation Layer: This is the top-most layer and allows users to manipulate data so that they can input it for querying. This data input from users is passed on to the database layer through the application layer and the results are passed back to the application layer to implement business logics. The presentation layer can be anything—the …

Top 100 Hadoop Complex Interview Questions (Part 2 of 4)

1.If a data Node is full how it’s identified? When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.
2.If datanodes increase, then do we need to upgrade Namenode? While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.
3.Are job tracker and task trackers present in separate machines? Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
4.When we send a data to a node, do we allow settling in time, before sending another data to that node? Yes, we do.
Related:Hadoop Complex Questions part-1
5.Does hadoop always require digital data to process? Yes.  Hadoop always…