Skip to main content

The Ultimate Cheat Sheet On Hadoop

Question 1

You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network?

A.      Writable
B.      WritableComparable
C.      InputFormat
D.      OutputFormat
E.       Combiner
F.       Partitioner

Ans:    e

Question 2

Where is Hive metastore stored by default ?
A.      In HDFS
B.      In client machine in the form of a flat file.
C.      In client machine in a derby database
D.      In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modified.
Ans;   c

Question 3

 The NameNode uses RAM for the following purpose:

A.      To store the contents in HDFS.
B.      To store the filenames, list of blocks and other meta information.
C.      To store log that keeps track of changes in HDFS.
D.      To manage distributed read and write locks on files in HDFS.
Ans: b
Question 4
What is true about reduce-side joining?
A.      It requires a lot of in-memory process.
B.      The amount of data written in the local of disk of the DataNode running the reduce task increases.
C.      The reduce task generates more output data than input data.
D.      It requires to declare custom partitioner and group comparator in the JobConf object.
Ans:    a 

Question 5

Consider the below query:

SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;

Is the output result stored in HDFS?

A.      Yes, inside newTable
B.      Yes, inside shakespeare.
C.      No, not at all.
D.      Maybe, depends on the permission given to the client

   Ans:   a

Question 6

One of the business analyst in your organization has very good expertise on C coding. He wants to clean and model the business data which is stored in HDFS. Which of the among is best suited for him?
      A.      HIVE
B.      PIG
D.      OOZIE
E.       HadoopStreaming
Ans:     c
Question 7

Which process describes the life cycle of a mapper?

A.      The jobTracker calls the TaskTracker’s configure () method, then  its map() method and finally its close() method.
B.      Task Tracker spawns a new mapper process to process all records of a single InputSplit.
C.      Task Tracker spawns a new mapper process to process each key-value pair.
D.      JobTracker spawns a new mapper process to process all records of single input file.

Ans:  c
Question 8

How does the NameNode detect that a DataNode has failed?

A.      The NameNode does not need to know that DataNode has failed.
B.      When the NameNode fails to receive periodic heartbeats from the DataNode, it considers the DataNode as failed.
C.      The NameNode pings the datanode. If the DataNode does not respond, the NameNode considers the DataNode failed.
D.      When HDFS starts up, the NameNode tries to communicate with the DataNode and considers the DataNodes failed if it does not respond.

Ans: b
Two files needs to be joined over a common column. Which technique is faster and why?
      A.      The reduce-side joining is faster as it receives the records sorted by keys.
B.      The reduce side joining is faster as it uses secondary sort.
C.      The map-side joining faster as it caches the data from one file in-memory.
D.      The map-side joining faster as it writes the intermediate data on local file system.

Ans:   b
You want to run two different jobs which may use same lookup data (For example, US state code). While submitting the first job you used the distributed cache to copy the lookup data file in each data node. Both the jobs have mapper configure method where the distributed file is retrieved programmatically and values are cached in a hash map. Both the job uses  ToolRunner so that the file for distributed cache can be  provided at the command prompt. You run the first job with data file passed to the distributed cache. When the job is complete you fire the second job without passing the lookup file to distributed cache. What is consequence? (Select one)
     A.      The first job runs but the second job fails. This is because, distributed cache is persistent as long as the job is not complete. After the job is complete the distributed cache gets removed.
B.      The first and second job completes without any problem as Distributed caches are once set those are permanently copied.
C.      The first and second job will be successfully completed if the number of reducer is set to zero. Because, distributed cache works only with map only jobs.
D.      Both the jobs are successful if those are chained using chain mapper or chain reducer. Because, distributed cache only works with ChainMapper or ChainReducer.
Ans:   d

You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?

A.      Run all the nodes in your production cluster as virtual machines on your development workstation.
B.      Run the hadoop command with the –jt local and the –fs file:/// options.
C.      Run the DataNode, TaskTracker, JobTracker and NameNode daemons  on a single machine.
D.      Run simpldoop, Apache open source software for simulating Hadoop cluster.
Ans: c

Question 12

 MapReduce is well-suited for all of the following EXCEPT? (Choose one)

A.    Text mining on large collections of unstructured documents.
B.    Analysis of large amounts of web logs (queries, clicks etc.).
C.   Online transaction processing (OLTP) for an e-commerce Website.
D.   Graph mining on a large social network (e.g. Facebook friend’s network).

Ans: a

Question 13

Your cluster has 10 Datanodes, each with a single 1 TB hard drive. You utilize all your disk capacity for HDFS, reserving none for MapReduce. You implement default replication settings. What is the storage capacity of your Hadoop cluster (assuming no compression)?

A.      About 3 TB
B.      About 5 TB
C.      About 10TB
D.      About 11TB

 Ans: c
Question 14

Combiners increase the efficiency of a MapReduce program because:

A.      They provide a mechanism for different mappers to communicate with each other, thereby reducing synchronization overhead.
B.      They provide an optimization and reduce the total number of computations that are needed to execute an algorithm by a factor of n, where n are the number of reducers.
C.      They aggregate map output locally in each individual machine and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
D.      They aggregate intermediate map output to a small number of nearby (i.e. rack local) machines and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.

 Ans:   c

Question 15

When is the reduce method first called in a MapReduce Job?

A.      Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only and reduce-only jobs.
B.      Reducers start copying the intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
C.      Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
D.      Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.


Question 16

Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot to schedule a MapReduce operation.
A.      TaskTracker
B.      NameNode
C.      DataNode
D.      JobTracker
E.       Secondary Namenode.

Question 17

What is the maximum limit for key-value pair that a mapper can emit ?

A.       Its equivalent to number of lines in input files.
B.       Its equivalent to number of times map() method is called in mapper task.
C.      There is no such restriction. It depends on the use case and logic.
D.       1000


Question 18

What is the disadvantage of using multiple reducers with default HashPartioner and distributing your workload across your cluster.

A.      You will not be able to compress your intermediate data.
B.      You will no longer will be able to take the advantage of a Combiner.
C.      The output files may not be in global sorted order.
D.      There is no problem.

Ans:  d

Question 19

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?

A. Combiner
B. Mapper
 C. Reducer
D. Reducer
E. Combiner

Question 20

During the standard sort and shuffle phase of MapReduce, keys and values are passed to reducers. Which of the following is true?

A.      Keys are presented to a reducer in sorted order; values for a given key are not sorted.
B.      Keys are presented to a reducer in sorted order; values for a given key are sorted in ascending order.
C.      Keys are presented to a reducer in random order; values for a given key are not sorted.
D.      Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.



Popular posts from this blog

Four Tableau products a quick review and explanation

I want to share you what are the Products most popular.

Total four products. Read the details below.

Tableau desktop-(Business analytics anyone can use) - Tableau  Desktop  is  based  on  breakthrough technology  from  Stanford  University  that  lets  you drag & drop to analyze data. You can connect to  data in a few clicks, then visualize and create interactive dashboards with a few more.

We’ve done years of research to build a system that supports people’s natural  ability  to  think visually. Shift fluidly between views, following your natural train of thought. You’re not stuck in wizards or bogged down writing scripts. You just create beautiful, rich data visualizations.  It's so easy to use that any Excel user can learn it. Get more results for less effort. And it’s 10 –100x faster than existing solutions.

Tableau server
Tableau  Server  is  a  business  intelligence  application  that  provides  browser-based  analytics anyone can use. It’s a rapid-fire alternative to th…

Different Types Of Payment Cards you need to know

The Credit Card (Shopping): The purpose o this card is to buy any item withing the limit prescribed by banks to cardholder. These cards can have both Magnetic stripe and Chip cards. 
Now a days all banks are issuing credit cards with CHIP and PIN. After entering the PIN by cardholder, then transaction starts for further processing.

The debit (ATM, Cash) card is a relatively new method of payment. It is different from a credit card because the debit cardholder pays with the money available in their bank account, which is debited immediately in real time. A debit card seems to be more dangerous compared to a credit card because the debit card is directly linked to the bank checking account and usually allows ATM cash withdrawals.

On the other hand, it is more protected by the required two-factor authentication (PIN number plus card itself). The real dangerous element of many branded debit cards is that they can be processed as credit cards, without entering the PIN.

The Gift card
is simi…

The Sqoop in Hadoop story to process structural data

Why Sqoop you need while working on Hadoop-The Sqoop and its primary reason is to import data from structural data sources such as Oracle/DB2 into HDFS(also called Hadoop file system).
To our readers, I have collected a good video from Edureka which helps you to understand the functionality of Sqoop.

The comparison between Sqoop and Flume

The Sqoop the word came from SQL+Hadoop Sqoop word came from SQL+HADOOP=SQOOP. And Sqoop is a data transfer tool. The main use of Sqoop is to import and export the large amount of data from RDBMS to HDFS and vice versa. List of basic Sqoop commands Codegen- It helps to generate code to interact with database records.Create-hive-table- It helps to Import a table definition into a hiveEval- It helps to evaluateSQL statement and display the resultsExport-It helps to export an HDFS directory into a database tableHelp- It helps to list the available commandsImport- It helps to import a table from a database to HDFSImport-all-tables- It helps to import tables …