Skip to main content

The Ultimate Cheat Sheet On Hadoop

Top 20 frequently asked questions to test your Hadoop knowledge given in the below Hadoop cheat sheet. Try finding your own answers and match the answers given here.
cheat sheet
Question #1 
You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network?
  • A. Writable
  • B. WritableComparable
  • C. InputFormat
  • D. OutputFormat
  • E. Combiner
  • F. Partitioner
Ans: e
Question #2
Where is Hive metastore stored by default ?
  • A. In HDFS
  • B. In client machine in the form of a flat file.
  • C. In client machine in a derby database
  • D. In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modified.
Ans: c
Question #3
The NameNode uses RAM for the following purpose:
  • A. To store the contents in HDFS.
  • B. To store the filenames, list of blocks and other meta information.
  • C. To store log that keeps track of changes in HDFS.
  • D. To manage distributed read and write locks on files in HDFS.
Ans: b
Question #4
What is true about reduce-side joining?
  • A. It requires a lot of in-memory process.
  • B. The amount of data written in the local of disk of the DataNode running the reduce task increases.
  • C. The reduce task generates more output data than input data.
  • D. It requires to declare custom partitioner and group comparator in the JobConf object.
Ans: a 
Question #5
Consider the below query:
INSERT OVERWRITE TABLE newTable
SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;
Is the output result stored in HDFS?
  • A. Yes, inside newTable
  • B. Yes, inside shakespeare.
  • C. No, not at all.
  • D. Maybe, depends on the permission given to the client
Ans: a
Question #6
One of the business analyst in your organization has very good expertise on C coding. He wants to clean and model the business data which is stored in HDFS. Which of the among is best suited for him?
  • A. HIVE
  • B. PIG
  • C. MAPRDEDUCE
  • D. OOZIE
  • E. HadoopStreaming
Ans: c
Question #7
Which process describes the life cycle of a mapper?
  • A. The jobTracker calls the TaskTracker’s configure () method, then its map() method and finally its close() method.
  • B. Task Tracker spawns a new mapper process to process all records of a single InputSplit.
  • C. Task Tracker spawns a new mapper process to process each key-value pair.
  • D. JobTracker spawns a new mapper process to process all records of single input file.
Ans: c
Question #8
How does the NameNode detect that a DataNode has failed?
  • A. The NameNode does not need to know that DataNode has failed.
  • B. When the NameNode fails to receive periodic heartbeats from the DataNode, it considers the DataNode as failed.
  • C. The NameNode pings the datanode. If the DataNode does not respond, the NameNode considers the DataNode failed.
  • D. When HDFS starts up, the NameNode tries to communicate with the DataNode and considers the DataNodes failed if it does not respond.
Ans: b
Question #9
Two files needs to be joined over a common column. Which technique is faster and why?
  • A. The reduce-side joining is faster as it receives the records sorted by keys.
  • B. The reduce side joining is faster as it uses secondary sort.
  • C. The map-side joining faster as it caches the data from one file in-memory.
  • D. The map-side joining faster as it writes the intermediate data on local file system.
Ans: b
Question #10
You want to run two different jobs which may use same lookup data (For example, US state code). While submitting the first job you used the distributed cache to copy the lookup data file in each data node. Both the jobs have mapper configure method where the distributed file is retrieved programmatically and values are cached in a hash map. Both the job uses ToolRunner so that the file for distributed cache can be provided at the command prompt. You run the first job with data file passed to the distributed cache. When the job is complete you fire the second job without passing the lookup file to distributed cache. What is consequence? (Select one)
  • A. The first job runs but the second job fails. This is because, distributed cache is persistent as long as the job is not complete. After the job is complete the distributed cache gets removed.
  • B. The first and second job completes without any problem as Distributed caches are once set those are permanently copied.
  • C. The first and second job will be successfully completed if the number of reducer is set to zero. Because, distributed cache works only with map only jobs.
  • D. Both the jobs are successful if those are chained using chain mapper or chain reducer. Because, distributed cache only works with ChainMapper or ChainReducer.
Ans: d
Question #11
You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?
  • A. Run all the nodes in your production cluster as virtual machines on your development workstation.
  • B. Run the hadoop command with the –jt local and the –fs file:/// options.
  • C. Run the DataNode, TaskTracker, JobTracker and NameNode daemons on a single machine.
  • D. Run simpldoop, Apache open source software for simulating Hadoop cluster.
Ans: c
Question #12
MapReduce is well-suited for all of the following EXCEPT? (Choose one)
  • A. Text mining on large collections of unstructured documents.
  • B. Analysis of large amounts of web logs (queries, clicks etc.).
  • C. Online transaction processing (OLTP) for an e-commerce Website.
  • D. Graph mining on a large social network (e.g. Facebook friend’s network).
Ans: a
Question #13
Your cluster has 10 Datanodes, each with a single 1 TB hard drive. You utilize all your disk capacity for HDFS, reserving none for MapReduce. You implement default replication settings. What is the storage capacity of your Hadoop cluster (assuming no compression)?
  • A. About 3 TB
  • B. About 5 TB
  • C. About 10TB
  • D. About 11TB
Ans: c
Question #15
Combiners increase the efficiency of a MapReduce program because:
  • A. They provide a mechanism for different mappers to communicate with each other, thereby reducing synchronization overhead.
  • B. They provide an optimization and reduce the total number of computations that are needed to execute an algorithm by a factor of n, where n are the number of reducers.
  • C. They aggregate map output locally in each individual machine and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
  • D. They aggregate intermediate map output to a small number of nearby (i.e. rack local) machines and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
Ans: c
Question #16
When is the reduce method first called in a MapReduce Job?
  • A. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only and reduce-only jobs.
  • B. Reducers start copying the intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
  • C. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
  • D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
Ans: c
Question #17
Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot to schedule a MapReduce operation.
  • A. TaskTracker
  • B. NameNode
  • C. DataNode
  • D. JobTracker
  • E. Secondary Namenode
Ans:d
Question #18
What is the maximum limit for key-value pair that a mapper can emit ?
  • A. Its equivalent to number of lines in input files.
  • B. Its equivalent to number of times map() method is called in mapper task.
  • C. There is no such restriction. It depends on the use case and logic.
  • D. 1000
Ans:b
Question #19
What is the disadvantage of using multiple reducers with default HashPartioner and distributing your workload across your cluster.
  • A. You will not be able to compress your intermediate data.
  • B. You will no longer will be able to take the advantage of a Combiner.
  • C. The output files may not be in global sorted order.
  • D. There is no problem.
Ans: d
Question #20
You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?
  • A. Combiner
  • B. Mapper
  • C. Reducer
  • D. Reducer
  • E. Combiner
Ans:a
Question #21 (Bonus Question)
During the standard sort and shuffle phase of MapReduce, keys and values are passed to reducers. Which of the following is true?
  • A. Keys are presented to a reducer in sorted order; values for a given key are not sorted.
  • B. Keys are presented to a reducer in sorted order; values for a given key are sorted in ascending order.
  • C. Keys are presented to a reducer in random order; values for a given key are not sorted.
  • D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.
Ans:c

Also Read

Comments

Popular posts from this blog

Blue Prism complete tutorials download now

Blue prism is an automation tool useful to execute repetitive tasks without human effort. To learn this tool you need the right material. Provided below quick reference materials to understand detailed elements, architecture and creating new bots. Useful if you are a new learner and trying to enter into automation career. The number one and most popular tool in automation is a Blue prism. In this post, I have given references for popular materials and resources so that you can use for your interviews.
RPA Blue Prism RPA blue prism tutorial popular resources I have given in this post. You can download quickly. Learning Blue Prism is a really good option if you are a learner of Robotic process automation.
RPA Advantages The RPA is also called "Robotic Process Automation"- Real advantages are you can automate any business process and you can complete the customer requests in less time.

The Books Available on Blue Prism 
Blue Prism resourcesDavid chappal PDF bookBlue Prism BlogsVi…

Hyperledger Fabric Real Interview Questions Read Today

I am practicing Hyperledger. This is one of the top listed blockchains. This architecture follows R3 Corda specifications. Sharing the interview questions with you that I have prepared for my interview.

Though Ethereum leads in the real-time applications. The latest Hyperledger version is now ready for production applications. It has now become stable for production applications.
The Hyperledger now backed by IBM. But, it is still an open source. These interview questions help you to read quickly. The below set of interview questions help you like a tutorial on Hyperledger fabric. Hyperledger Fabric Interview Questions1). What are Nodes?
In Hyperledger the communication entities are called Nodes.

2). What are the three different types of Nodes?
- Client Node
- Peer Node
- Order Node
The Client node initiates transactions. The peer node commits the transaction. The order node guarantees the delivery.

3). What is Channel?
A channel in Hyperledger is the subnet of the main blockchain. You c…

Data analysis tools top demand in the job market to read today

Data analytics is the job role hot in demand in each organization. The digital skills such as Mobile development, Full stack development, and Data Science, and Cloud computing are successful because those are very user-friendly to the end users.
Predictive Analytics Digital devices enabled with digital technologies cause to generate more data. You need different kinds of tools to analyze data of different format.

You need the right tools. Else you cannot predict user mind. User search data is the source for big retail markets. Based on these search words, they start selling the products.

The motto behind data analytics is to get the benefit to all stakeholders.
Cloud Computing Let us take a cloud computing the main advantage is cost-effective and scalability. Top Data Analytics Tools in DemandR ProgrammingSASExcelTableauQlikViewTop Magazines in Data AnalyticsAnalytics InsightAnalytics MagazineAnalytics India Magazine Related PostsR Vs SAS Top Differences6 Top IT Skills that have Huge D…

Three popular RPA tools functional differences

Robotic process automation is growing area and many IT developers across the board started up-skill in this popular area. I have written this post for the benefit of Software developers who are interested in RPA also called Robotic Process Automation.

In my previous post, I have described that total 12 tools are available in the market. Out of those 3 tools are most popular. Those are Automation anywhere, BluePrism and Uipath. Many programmers asked what are the differences between these tools. I have given differences of all these three RPA tools.

BluePrism Blue Prism has taken a simple concept, replicating user activity on the desktop, and made it enterprise strength. The technology is scalable, secure, resilient, and flexible and is supported by a comprehensive methodology, operational framework and provided as packaged software.The technology is developed and deployed within a “corridor of IT governance” and has sophisticated error handling and process modelling capabilities to ens…

R Vs SAS differences to read today

Statistical analysis should know by every software engineer. R is an open source statistical programming language. SAS is licensed analysis suite for statistics. The two are very much popular in Machine learning and data analytics projects.
SAS is analysis suite software and R is a programming language R ProgrammingR supports both statistical analysis and GraphicsR is an open source project.R is 18th most popular LanguageR packages are written in C, C++, Java, Python and.NetR is popular in Machine learning, data mining and Statistical analysis projects. SASSAS is a statistical analysis suite. Developed to process data sets in mainframe computers.Later developed to support multi-platforms. Like  Mainframe, Windows, and LinuxSAS has multiple products. SAS/ Base is very basic level.SAS is popular in data related projects. Learn SAS vs R Top Differences between SAS Vs R Programming SAS AdvantagesThe data integration from any data source is faster in SAS.The licensed software suite, so you…

6 Top IT Skills that have Huge demand for the next 5 Years

These are top IT skills you need to know. Also, these are highly employable skills. These you can say as digital skills. Digital skills fetch you best salary, according to surveys by top job portals.
6 Top IT SKillsSAS and R ProgrammingFull stack developmentData EngineeringData ScineceMobile development - Perl, Ruby and PythonMiddleware integration software.The Bottom LineThe trend is now changed. You can attend off-line and On-line courses and you can practices daily two hours.Within 2 or 3 months you can be perfect.Select any one skill for your employability.You cannot learn all the skills. It is dofficult to tell answers in the interviews.