Featured Post

8 Ways to Optimize AWS Glue Jobs in a Nutshell

  Improving the performance of AWS Glue jobs involves several strategies that target different aspects of the ETL (Extract, Transform, Load) process. Here are some key practices. 1. Optimize Job Scripts Partitioning : Ensure your data is properly partitioned. Partitioning divides your data into manageable chunks, allowing parallel processing and reducing the amount of data scanned. Filtering : Apply pushdown predicates to filter data early in the ETL process, reducing the amount of data processed downstream. Compression : Use compressed file formats (e.g., Parquet, ORC) for your data sources and sinks. These formats not only reduce storage costs but also improve I/O performance. Optimize Transformations : Minimize the number of transformations and actions in your script. Combine transformations where possible and use DataFrame APIs which are optimized for performance. 2. Use Appropriate Data Formats Parquet and ORC : These columnar formats are efficient for storage and querying, signif

The Ultimate Cheat Sheet On Hadoop

Top 20 frequently asked questions to test your Hadoop knowledge given in the below Hadoop cheat sheet. Try finding your own answers and match the answers given here.

The Ultimate Cheat Sheet On Hadoop

Question #1 

You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network?

A. Writable
B. WritableComparable
C. InputFormat
D. OutputFormat
E. Combiner
F. Partitioner
Ans: e

Question #2 

Where is Hive metastore stored by default ?

B. In client machine in the form of a flat file.
C. In client machine in a derby database
D. In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modified.
Ans: c

Question #3
The NameNode uses RAM for the following purpose:

A. To store the contents in HDFS.
B. To store the filenames, list of blocks and other meta information.
C. To store log that keeps track of changes in HDFS.
D. To manage distributed read and write locks on files in HDFS.
Ans: b

Question #4
What is true about reduce-side joining?

A. It requires a lot of in-memory process.
B. The amount of data written in the local of disk of the DataNode running the reduce task increases.
C. The reduce task generates more output data than input data.
D. It requires to declare custom partitioner and group comparator in the JobConf object.
Ans: a

Question #5 

Consider the below query:
SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;
Is the output result stored in HDFS?

A. Yes, inside newTable
B. Yes, inside shakespeare.
C. No, not at all.
D. Maybe, depends on the permission given to the client
Ans: a

Question #6 

One of the business analyst in your organization has very good expertise on C coding. He wants to clean and model the business data which is stored in HDFS. Which of the among is best suited for him?

E. HadoopStreaming
Ans: c 

Question #7 

Which process describes the life cycle of a mapper?

A. The jobTracker calls the TaskTracker’s configure () method, then its map() method and finally its close() method.
B. Task Tracker spawns a new mapper process to process all records of a single InputSplit.
C. Task Tracker spawns a new mapper process to process each key-value pair.
D. JobTracker spawns a new mapper process to process all records of single input file.
Ans: c

Question #8 

How does the NameNode detect that a DataNode has failed?

A. The NameNode does not need to know that DataNode has failed.
B. When the NameNode fails to receive periodic heartbeats from the DataNode, it considers the DataNode as failed.
C. The NameNode pings the datanode. If the DataNode does not respond, the NameNode considers the DataNode failed.
D. When HDFS starts up, the NameNode tries to communicate with the DataNode and considers the DataNodes failed if it does not respond.
Ans: b

Question #9 

Two files needs to be joined over a common column. Which technique is faster and why?

A. The reduce-side joining is faster as it receives the records sorted by keys.
B. The reduce side joining is faster as it uses secondary sort.
C. The map-side joining faster as it caches the data from one file in-memory.
D. The map-side joining faster as it writes the intermediate data on local file system.
Ans: b

Question #10 

You want to run two different jobs which may use same lookup data (For example, US state code). While submitting the first job you used the distributed cache to copy the lookup data file in each data node. Both the jobs have mapper configure method where the distributed file is retrieved programmatically and values are cached in a hash map. Both the job uses ToolRunner so that the file for distributed cache can be provided at the command prompt. You run the first job with data file passed to the distributed cache. When the job is complete you fire the second job without passing the lookup file to distributed cache. What is consequence? (Select one)

A. The first job runs but the second job fails. This is because, distributed cache is persistent as long as the job is not complete. After the job is complete the distributed cache gets removed.
B. The first and second job completes without any problem as Distributed caches are once set those are permanently copied.
C. The first and second job will be successfully completed if the number of reducer is set to zero. Because, distributed cache works only with map only jobs.
D. Both the jobs are successful if those are chained using chain mapper or chain reducer. Because, distributed cache only works with ChainMapper or ChainReducer.
Ans: d

Question #11 

You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?
A. Run all the nodes in your production cluster as virtual machines on your development workstation.
B. Run the hadoop command with the –jt local and the –fs file:/// options.
C. Run the DataNode, TaskTracker, JobTracker and NameNode daemons on a single machine.
D. Run simpldoop, Apache open source software for simulating Hadoop cluster.
Ans: c

Question #12 

MapReduce is well-suited for all of the following EXCEPT? (Choose one)

A. Text mining on large collections of unstructured documents.
B. Analysis of large amounts of web logs (queries, clicks etc.).
C. Online transaction processing (OLTP) for an e-commerce Website.
D. Graph mining on a large social network (e.g. Facebook friend’s network).
Ans: a 

Question #13 

Your cluster has 10 Datanodes, each with a single 1 TB hard drive. You utilize all your disk capacity for HDFS, reserving none for MapReduce. You implement default replication settings. What is the storage capacity of your Hadoop cluster (assuming no compression)?

A. About 3 TB
B. About 5 TB
C. About 10TB
D. About 11TB
Ans: c

Question #14 

Combiners increase the efficiency of a MapReduce program because:

A. They provide a mechanism for different mappers to communicate with each other, thereby reducing synchronization overhead.
B. They provide an optimization and reduce the total number of computations that are needed to execute an algorithm by a factor of n, where n are the number of reducers.
C. They aggregate map output locally in each individual machine and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
D. They aggregate intermediate map output to a small number of nearby (i.e. rack local) machines and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
Ans: c

Question #15 
When is the reduce method first called in a MapReduce Job?

A. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only and reduce-only jobs.
B. Reducers start copying the intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
C. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
Ans: c

Question #16 
Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot to schedule a MapReduce operation.

A. TaskTracker
B. NameNode
C. DataNode
D. JobTracker
E. Secondary Namenode

Question #17

What is the maximum limit for key-value pair that a mapper can emit ?

A. Its equivalent to number of lines in input files.
B. Its equivalent to number of times map() method is called in mapper task.
C. There is no such restriction. It depends on the use case and logic.
D. 1000

Question #18
What is the disadvantage of using multiple reducers with default HashPartioner and distributing your workload across your cluster.

A. You will not be able to compress your intermediate data.
B. You will no longer will be able to take the advantage of a Combiner.
C. The output files may not be in global sorted order.
D. There is no problem.
Ans: d

Question #19

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?

A. Combiner
B. Mapper
C. Reducer
D. Reducer
E. Combiner

Question #20

(Bonus Question)
During the standard sort and shuffle phase of MapReduce, keys and values are passed to reducers. Which of the following is true?

A. Keys are presented to a reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to a reducer in sorted order; values for a given key are sorted in ascending order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.

Also Read


Popular posts from this blog

How to Fix datetime Import Error in Python Quickly

How to Check Kafka Available Brokers

SQL Query: 3 Methods for Calculating Cumulative SUM