Posts

Showing posts from September, 2015

Featured Post

SQL Interview Success: Unlocking the Top 5 Frequently Asked Queries

Image
 Here are the five top commonly asked SQL queries in the interviews. These you can expect in Data Analyst, or, Data Engineer interviews. Top SQL Queries for Interviews 01. Joins The commonly asked question pertains to providing two tables, determining the number of rows that will return on various join types, and the resultant. Table1 -------- id ---- 1 1 2 3 Table2 -------- id ---- 1 3 1 NULL Output ------- Inner join --------------- 5 rows will return The result will be: =============== 1  1 1   1 1   1 1    1 3    3 02. Substring and Concat Here, we need to write an SQL query to make the upper case of the first letter and the small case of the remaining letter. Table1 ------ ename ===== raJu venKat kRIshna Solution: ========== SELECT CONCAT(UPPER(SUBSTRING(name, 1, 1)), LOWER(SUBSTRING(name, 2))) AS capitalized_name FROM Table1; 03. Case statement SQL Query ========= SELECT Code1, Code2,      CASE         WHEN Code1 = 'A' AND Code2 = 'AA' THEN "A" | "A

6 Advantages of Columnar Databases over Traditional RDBMS

Image
In traditional RDBMS, when a data source is accessed by multi users at single time, then database will go into deadlock state. One of the advantages of a columnar model is that if two or more users want to use a different subset of columns, they do not have to lock out each other.         (Superior benefits for NoSQL Jobs) This design is made easier because of a disk storage method known as RAID (redundant array of independent disks, originally redundant array of inexpensive disks), which combines multiple disk drives into a logical unit. Data is stored in several patterns called levels that have different amounts of redundancy. The idea of the redundancy is that when one drive fails, the other drives can take over. When a replacement disk drive in put in the array, the data is replicated from the other disks in the array and the system is restored. The following are the various levels of RAID: RAID 0 (block-level striping without parity or mirroring) has no (or zero) re

Top features of Apache Avro in Hadoop eco-System

Image
Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages. The Hadoop ecosystem includes a new  binary data serialization system  — Avro.  Avro provides: ·       Rich data structures. ·          A compact, fast, binary data format. ·          A container file, to store persistent data. ·          Remote procedure call (RPC). ·         Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Its functionality is similar to the other marshaling systems such as Thrift, Protocol Buffers, and so on. The main differentiators of Avro include the following: [Hadoop Interview Questions] Dynamic typing —  The Avro implementation always keeps data and its corresponding schema together. As a resu

RDBMS Vs Key-value Four Top Differences

Image
This post tells you differences between rdbms and distributed key-value storage. Rdbms is quite  different from key-value storage. RDBMS (Relational Database) You have already used a  r elational  d atabase  m anagement  s ystem — a storage product that's commonly referred to as  RDBMS .  It is basically a structured data. RDBMS systems are fantastically useful to handle moderate data. The BIG challenge is in scaling beyond a single server.  You can't maintain redundant data in rdbms. All the data available on single server. The entire database runs on single server. So when server is down then database may not be available to normal business operations. Outages and server downs are common in this rdbms model of database. Key-Value Database Key-value storage systems often make use of redundancy within hardware resources to prevent outages. This concept is important when you're running thousands of servers because they're bound to suffer hardware bre

Amazon web services -Object Storage

Image
Object Storage: Object storage provides the ability to store, well, objects — which are essentially collections of digital bits. Those bits may represent a digital photo, an MRI scan, a structured document such as an XML file — or the video of your cousin's embarrassing attempt to ride a skateboard down the steps at the public library (the one you premiered at his wedding). Object storage offers the reliable (and highly scalable) storage of collections of bits, but imposes no structure on the bits. The structure is chosen by the user, who needs to know, for example, whether an object is a photo (which can be edited), or an MRI scan (which requires a special application for viewing it). The user has to know both the format as well as the manipulation methods of the object. The object storage service simply provides reliable storage of the bits. Difference between Object storage and File storage Object storage differs from file storage, which you may be more familiar with from usi

SOAP Vs REST top differences you need to know

What is SOAP? SOAP is based on a document encoding standard known as Extensible Markup Language (XML, for short), and the SOAP service is defined in such a way that users can then leverage XML no matter what the underlying communication network is. For this system to work, though, the data transferred by SOAP (commonly referred to as the payload) also needs to be in XML format. Notice a pattern here? The push to be comprehensive and flexible (or, to be all things to all people) plus the XML payload requirement meant that SOAP ended up being quite complex, making it a lot of work to use properly. As you might guess, many IT people found SOAP daunting and, consequently, resisted using it. About a decade ago, a doctoral student defined another web services approach as part of his thesis: REST Representational State Transfer, which is far less comprehensive than SOAP, aspires to solve fewer problems. It doesn't address some aspects of SOAP that seemed important but tha

SQL Vs NOSQL real differences to read today

Image
SQL and NoSQL both or two different languages that will be used on different databases. In resolving bigdata analytics NoSQL is most popular. Where as SQL is popular in relational databases. SQL Vs NOSQL Top Differences SQL SQL is structured query language  It was first commercial language used in RDBMS  SQL language is divided into multiple sub elements NoSQL Data is not in one machine or even one network.  Data can be any type public data and private data  Huge volume of data so you cannot put it in one place.  It is uncoordinated in time as well as space.  It is not always nice, structured data that SQL was meant to handle. Also Read RDBMS Vs NoSQL Databases top differences

What is CompTIA Cloud+ Certification

Image
#What is CompTIA Cloud+ Certification: The CompTIA Cloud+ certification is an internationally recognized validation of the knowledge required of IT practitioners working in cloud computing environments. This exam will certify that the successful candidate has the knowledge and skills required to understand standard cloud terminology and methodologies to implement, maintain, and deliver cloud technologies and infrastructures (e.g., server, network, storage, and visualization technologies); and to understand aspects of IT security and use of industry best practices related to cloud implementations and the application of virtualization. Related: Cloud Computing+Jobs Cloud+ certified professionals ensure that proper security measures are maintained for cloud systems, storage, and platforms to mitigate risks and threats while ensuring usability. The exam is geared toward IT professionals with 24 to 36 months of experience in IT networking, network storage, or data center adminis

Oracle 12C 'Bitmap Index' benefits over B-tree Index

Image
#Oracle 12C 'Bitmap Index' benefits over B-tree Index: A bitmap index has a significantly different structure from a B-tree index in the leaf node of the index. It stores one string of bits for each possible value (the cardinality) of the column being indexed. Note: One string of BITs means -Each tupple of possible value it assigns '1' bit in a string.So, all the BITs become a string ( This is an example, on which column you created BIT map index) The length of the string of bits is the same as the number of rows in the table being indexed. In addition to saving a tremendous amount of space compared to traditional indexes, a bitmap index can provide dramatic improvements in response time because Oracle can quickly remove potential rows from a query containing multiple WHERE clauses long before the table itself needs to be accessed. Multiple bitmaps can use logical AND and OR operations to determine which rows to access from the table. Although you ca

What is Elastic Nature in Cloud Computing

Natural clouds are indeed elastic, expanding and contracting based on the force of the winds carrying them. The cloud is similarly elastic, expanding and shrinking based on resource usage and cloud tenant resource demands. The physical resources (computing, storage, networking, etc.) deployed within the data center or across data centers and bundled as a single cloud usually do not change that fast. This elastic nature, therefore, is something that is built into the cloud at the software stack level, not the hardware. Best cloud computing example:  The classic promise of the cloud is to make compute resources available on demand, which means that theoretically, a cloud should be able to scale as a business grows and shrink as the demand diminishes. Consider here, for example, Amazon.com during Black Friday. There's a spike in inbound traffic, which translates into more memory consumption, increased network density, and increased compute resource utilization. If Amazon.com ha

Essential features of Hadoop Data joins (1 of 2)

Limitation of map side joining:   A record being processed by a mapper may be joined with a record not easily accessible (or even located) by that mapper. This is the main limitation. Who will facilitate map side join: Hadoop's apache.hadoop.mapred.join package contains helper classes to facilitate this map side join. What is joining data in Hadoop: You will come across, you need to analyze data from multiple sources, this scenario Hadoop follows data joining. In the case database world, joining of two or more tables is called joining. In Hadoop joining data involved different approaches. Approaches: Reduce side join Replicated joins using a Distributed cache Semijoin-Reduce side join with map side filtering What is the functionality of Map-reduce job: The traditional MapReduce job reads a set of input data, performs some transformations in the map phase, sorts the results, performs another transformation in the reduce phase, and writes a set of output data. The

How to verify SSH Installed in Hadoop Cluster Quickly

Below command helps, whether SSH is installed or not on your Hadoop cluster. [hadoop-user@master]$ which ssh /user/bin/bash [hadoop-user@master] $ which sshd /user/bin/sshd [hadoop-user@master] $ which ssh -keygen /user/bin/sshd If you do not get proper response as above. That means that SSH is not installed on your cluster. Resolution: If you receive an error message /user/bin/which: no ssh in (/user/bin: /user/sbin....) You need to install open SSH (www.openssh.com) vial Linux package manager. Or by downloading the source directly. Note: This is usually done by System Admin.

How to Use Help Command in HDFS

Image
Sometimes as a Hadoop developer it is difficult to remember all the Hadoop commands. The HELP command useful to know the correct syntax. ---- How to List all HDFS Commands hadoop fs   ==> Enter This will list all Hadoop commands. Help Command in HDFS Hadoop commands are the flavor of UNIX. If you want to see each Command description, you can go for Hadoop help command. You can use the below command for help. hadoop fs -hlep ls   Deleting a File in Hadoop HDFS The below command helps   how to delete a file from Hadop cluster. hadoop fs -rm exmaple.txt

Why Amazon Web services AWS Cloud computing is so popular

Image
Amazon its Cloud computing services started in three stages: S3 (Simple storage service) SQS (Simple Que service) EC2 (Elastic compute cloud) Amazon Web Services was officially revealed to the world on March 13, 2006. On that day, AWS offered the Simple Storage Service, its first service. (As you may imagine, Simple Storage Services was soon shortened to S3.) The idea behind S3 was simple: It could offer the concept of object storage over the web, a setup where anyone could put an object — essentially, any bunch of bytes — into S3. Those bytes may comprise a digital photo or a file backup or a software package or a video or audio recording or a spreadsheet file or — well, you get the idea. S3 was relatively limited when it first started out. Though objects could, admittedly, be written or read from anywhere, they could be stored in only one region: the United States. Moreover, objects could be no larger than 5 gigabytes — not tiny by any means, but certainly smaller than ma

Top 100 Hadoop Complex interview questions (Part 4 of 4)

Image
Hadoop framework is most popular in data analytics and data related projects. I have given here my 4th set of questions for you to read quickly. 1) What is MapReduce? Ans) It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming. 2). What are ‘maps’ and ‘reduces’? Ans). ‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key-value pair, that is, an intermediate output in the local machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output. 3). What are the four basic parameters of a mapper? Ans) The four basic parameters of a mapper are LongWritable, text, text, and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters. 4). What are the four basic parame

Cloudera Impala top features useful for developers

Cloudera Impala that runs on Apache Hadoop. The program was proclaimed in October 2012 with a common beta trial dispersion. Popular usage is in data analytics.The key features useful for interviews. Impala The Apache-licensed Impala program begets scalable collateral database techniques to Hadoop, authorizing consumers to subject low-latency SQL requests to information kept in HDFS and Apache HBase short of needing information motion either alteration. Impala is amalgamated with Hadoop to employ the similar file and information setups, metadata, safeguarding and asset administration architectures applied by MapReduce, Apache Hive, Apache Pig and different Hadoop code. Impala Applications Impala is advanced for experts and information experts in science to accomplish systematic computational analysis of data or statistics on information kept in Hadoop through SQL either trade intellect implements.    The effect is that extensive information handling (via MapReduce) and two-way req

Top 100 Hadoop Complex Interview Questions (Part 3 of 4)

Image
These are complex Hadoop interview questions. This is my 3rd set of questions useful for your interviews (3 of 4).      1). What are the features of Standalone (local) mode? Ans). In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the least used environments. 2). What are the features of Pseudo mode? Ans). The pseudo mode is used both for development and in the QA environment. In the Pseudo mode, all the daemons run on the same machine. 3). Can we call VMs as pseudos? Ans). No, VMs are not pseudos because VM is something different and pseudo is very specific to Hadoop. 4). What are the features of Fully Distributed mode? Ans). The fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There i

Tutorial: SAP HANA Basics for Beginners

What is SAP HANA? HANA stands for High-Performance Analytic Appliance. SAP HANA is a combination of hardware and software, and is therefore an appliance. SAP HANA supports column- and row-level storage. We can store and perform analytics on a huge amount of real-time, non-aggregated transactional data. Hence, HANA acts as both a database and a warehousing tool, which helps in making decisions at the right time. Challenges in Traditional RDBMS? There are a few challenges in traditional databases, such as latency, the cost involved, and complexity in accessing databases. Related: SAP HANA jobs and career options What is Architecture of traditional RDBMS? Presentation Layer: This is the top-most layer and allows users to manipulate data so that they can input it for querying. This data input from users is passed on to the database layer through the application layer and the results are passed back to the application layer to implement business logics. The presentation layer

Top 100 Hadoop Complex Interview Questions (Part 2 of 4)

Image
I am giving a series of Hadoop interview questions. This is my 2nd set of questions. You can get quick benefits by reading these questions from start to end. 1). If a data Node is full how it’s identified? Ans). When data is stored in a data node, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full. 2). If data nodes increase, then do we need to upgrade Namenode? Ans). While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise. 3). Are job tracker and task trackers present in separate machines? Ans). Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. 4). When we send a da

Top 100 Hadoop Complex Interview Questions (Part 1 of 4)

Image
The below list is complex interview questions as part of Hadoop tutorial (part 1 of 4) you can go through these questions quickly. 1. What is BIG DATA? Ans). Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques. 2. Can you give some examples of Big Data? Ans). There are many real-life examples of Big Data! Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of sensor data for every 30 minutes of flying time. All these are a day to day examples of Big Data! 3. Can you give a detailed overview of the Big Data being generated by Facebook?   Ans). As of December 31, 2012, there are 1.06 billion monthly active users on Facebook and 680 million mobile users. On an average,

How to Setup Hadoop Cluster Top Ideas

Image
Hadoop cluster setup in Centos Operating System explained in this post. So you can install CentOs either in your Laptop or in Virtual Machine. Hadoop Cluster Setup Process 9 Steps Process to Setup Hadoop Cluster Step 1:  Installing Sun Java on Linux. Commands to execute for the same: sudo apt-add-repository ppa:flexiondotorg/java sudo apt-get update sudo apt-get install sun-java6-jre sun-java6-plugin sudo update-java-alternatives -s java-6-sun Step 2:  Create Hadoop User. Commands to execute for the same: $sudo addgroup hadoop $sudo adduser —ingroup hadoop hduser Step 3:  Install SSH Server if not already present. Commands are: $ sudo apt-get install openssh-server $ su - hduser $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Step 4:  Installing Hadoop. Commands for the same are: $wget http://www.eng.lsu.edu/mirrors/apache/hadoop/core/hadoop-0.22.0/hadoop-0.22.0.tar.gz $ cd /home/hduser $ tar xzf hadoop-0.