Skip to main content

The Ultimate Cheat Sheet On Hadoop

Question 1

You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network?

A.      Writable
B.      WritableComparable
C.      InputFormat
D.      OutputFormat
E.       Combiner
F.       Partitioner

Ans:    e

Question 2

Where is Hive metastore stored by default ?
A.      In HDFS
B.      In client machine in the form of a flat file.
C.      In client machine in a derby database
D.      In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modified.
Ans;   c

Question 3

 The NameNode uses RAM for the following purpose:

A.      To store the contents in HDFS.
B.      To store the filenames, list of blocks and other meta information.
C.      To store log that keeps track of changes in HDFS.
D.      To manage distributed read and write locks on files in HDFS.
Ans: b
Question 4
What is true about reduce-side joining?
A.      It requires a lot of in-memory process.
B.      The amount of data written in the local of disk of the DataNode running the reduce task increases.
C.      The reduce task generates more output data than input data.
D.      It requires to declare custom partitioner and group comparator in the JobConf object.
Ans:    a 

Question 5

Consider the below query:

SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;

Is the output result stored in HDFS?

A.      Yes, inside newTable
B.      Yes, inside shakespeare.
C.      No, not at all.
D.      Maybe, depends on the permission given to the client

   Ans:   a

Question 6

One of the business analyst in your organization has very good expertise on C coding. He wants to clean and model the business data which is stored in HDFS. Which of the among is best suited for him?
      A.      HIVE
B.      PIG
D.      OOZIE
E.       HadoopStreaming
Ans:     c
Question 7

Which process describes the life cycle of a mapper?

A.      The jobTracker calls the TaskTracker’s configure () method, then  its map() method and finally its close() method.
B.      Task Tracker spawns a new mapper process to process all records of a single InputSplit.
C.      Task Tracker spawns a new mapper process to process each key-value pair.
D.      JobTracker spawns a new mapper process to process all records of single input file.

Ans:  c
Question 8

How does the NameNode detect that a DataNode has failed?

A.      The NameNode does not need to know that DataNode has failed.
B.      When the NameNode fails to receive periodic heartbeats from the DataNode, it considers the DataNode as failed.
C.      The NameNode pings the datanode. If the DataNode does not respond, the NameNode considers the DataNode failed.
D.      When HDFS starts up, the NameNode tries to communicate with the DataNode and considers the DataNodes failed if it does not respond.

Ans: b
Two files needs to be joined over a common column. Which technique is faster and why?
      A.      The reduce-side joining is faster as it receives the records sorted by keys.
B.      The reduce side joining is faster as it uses secondary sort.
C.      The map-side joining faster as it caches the data from one file in-memory.
D.      The map-side joining faster as it writes the intermediate data on local file system.

Ans:   b
You want to run two different jobs which may use same lookup data (For example, US state code). While submitting the first job you used the distributed cache to copy the lookup data file in each data node. Both the jobs have mapper configure method where the distributed file is retrieved programmatically and values are cached in a hash map. Both the job uses  ToolRunner so that the file for distributed cache can be  provided at the command prompt. You run the first job with data file passed to the distributed cache. When the job is complete you fire the second job without passing the lookup file to distributed cache. What is consequence? (Select one)
     A.      The first job runs but the second job fails. This is because, distributed cache is persistent as long as the job is not complete. After the job is complete the distributed cache gets removed.
B.      The first and second job completes without any problem as Distributed caches are once set those are permanently copied.
C.      The first and second job will be successfully completed if the number of reducer is set to zero. Because, distributed cache works only with map only jobs.
D.      Both the jobs are successful if those are chained using chain mapper or chain reducer. Because, distributed cache only works with ChainMapper or ChainReducer.
Ans:   d

You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?

A.      Run all the nodes in your production cluster as virtual machines on your development workstation.
B.      Run the hadoop command with the –jt local and the –fs file:/// options.
C.      Run the DataNode, TaskTracker, JobTracker and NameNode daemons  on a single machine.
D.      Run simpldoop, Apache open source software for simulating Hadoop cluster.
Ans: c

Question 12

 MapReduce is well-suited for all of the following EXCEPT? (Choose one)

A.    Text mining on large collections of unstructured documents.
B.    Analysis of large amounts of web logs (queries, clicks etc.).
C.   Online transaction processing (OLTP) for an e-commerce Website.
D.   Graph mining on a large social network (e.g. Facebook friend’s network).

Ans: a

Question 13

Your cluster has 10 Datanodes, each with a single 1 TB hard drive. You utilize all your disk capacity for HDFS, reserving none for MapReduce. You implement default replication settings. What is the storage capacity of your Hadoop cluster (assuming no compression)?

A.      About 3 TB
B.      About 5 TB
C.      About 10TB
D.      About 11TB

 Ans: c
Question 14

Combiners increase the efficiency of a MapReduce program because:

A.      They provide a mechanism for different mappers to communicate with each other, thereby reducing synchronization overhead.
B.      They provide an optimization and reduce the total number of computations that are needed to execute an algorithm by a factor of n, where n are the number of reducers.
C.      They aggregate map output locally in each individual machine and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
D.      They aggregate intermediate map output to a small number of nearby (i.e. rack local) machines and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.

 Ans:   c

Question 15

When is the reduce method first called in a MapReduce Job?

A.      Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only and reduce-only jobs.
B.      Reducers start copying the intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
C.      Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
D.      Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.


Question 16

Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot to schedule a MapReduce operation.
A.      TaskTracker
B.      NameNode
C.      DataNode
D.      JobTracker
E.       Secondary Namenode.

Question 17

What is the maximum limit for key-value pair that a mapper can emit ?

A.       Its equivalent to number of lines in input files.
B.       Its equivalent to number of times map() method is called in mapper task.
C.      There is no such restriction. It depends on the use case and logic.
D.       1000


Question 18

What is the disadvantage of using multiple reducers with default HashPartioner and distributing your workload across your cluster.

A.      You will not be able to compress your intermediate data.
B.      You will no longer will be able to take the advantage of a Combiner.
C.      The output files may not be in global sorted order.
D.      There is no problem.

Ans:  d

Question 19

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?

A. Combiner
B. Mapper
 C. Reducer
D. Reducer
E. Combiner

Question 20

During the standard sort and shuffle phase of MapReduce, keys and values are passed to reducers. Which of the following is true?

A.      Keys are presented to a reducer in sorted order; values for a given key are not sorted.
B.      Keys are presented to a reducer in sorted order; values for a given key are sorted in ascending order.
C.      Keys are presented to a reducer in random order; values for a given key are not sorted.
D.      Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.



Popular posts from this blog

The best 5 differences of AWS EMR and Hadoop

With Amazon Elastic MapReduce (Amazon EMR) you can analyze and process vast amounts of data. It does this by distributing the computational work across a cluster of virtual servers running in the Amazon cloud. The cluster is managed using an open-source framework called Hadoop.

Amazon EMR has made enhancements to Hadoop and other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and CloudWatch to monitor cluster performance and raise alarms.

You can also move data into and out of DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster.

What does Hadoop do...

Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for proce…

Top 20 ultimate ETL Questions really good for interviews

How to print/display the first line of a file?  there are many ways to do this. However the easiest way to display the first line of a file is using the [head] command.  $> head -1 file. Txt no prize in guessing that if you specify [head -2] then it would print first 2 records of the file.  another way can be by using [sed] command. [sed] is a very powerful text editor which can be used for various text manipulation purposes like this.  $> sed '2,$ d' file. Txt how does the above command work?  The 'd' parameter basically tells [sed] to delete all the records from display from line 2 to last line of the file (last line is represented by $ symbol). Of course it does not actually delete those lines from the file, it just does not display those lines in standard output screen. So you only see the remaining line which is the 1st line.  how to print/display the last line of a file?  the easiest way is to use the [tail] command.  $> tail -1 file. Txt if you want to do it using…

5 Things About AWS EC2 You Need to Focus!

Amazon Elastic Compute Cloud (Amazon EC2) - is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction.

The basic functions of EC2... 
It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment.Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change.Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios. 
Key Points for Interviews:
EC2 is the basic fundamental block around which the AWS are structured.EC2 provides remote ope…