Skip to main content

Featured post

8 Top Data Science Platform Developers in the World

Top data science tools and platforms providers across the world. Useful information for data science and data analytics developers.

Data Science is a combination of multiple skills. AI and Machine Learning are part of data science. You can create AI and Machine Learning products with data. 

Related Posts

Top Skills You Need for Data Science CareerData Science Sample Project an Example

The Ultimate Cheat Sheet On Hadoop

Top 20 frequently asked questions to test your Hadoop knowledge given in the below Hadoop cheat sheet. Try finding your own answers and match the answers given here.
cheat sheet
Question #1 
You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network?
  • A. Writable
  • B. WritableComparable
  • C. InputFormat
  • D. OutputFormat
  • E. Combiner
  • F. Partitioner
Ans: e
Question #2
Where is Hive metastore stored by default ?
  • A. In HDFS
  • B. In client machine in the form of a flat file.
  • C. In client machine in a derby database
  • D. In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modified.
Ans: c
Question #3
The NameNode uses RAM for the following purpose:
  • A. To store the contents in HDFS.
  • B. To store the filenames, list of blocks and other meta information.
  • C. To store log that keeps track of changes in HDFS.
  • D. To manage distributed read and write locks on files in HDFS.
Ans: b
Question #4
What is true about reduce-side joining?
  • A. It requires a lot of in-memory process.
  • B. The amount of data written in the local of disk of the DataNode running the reduce task increases.
  • C. The reduce task generates more output data than input data.
  • D. It requires to declare custom partitioner and group comparator in the JobConf object.
Ans: a 
Question #5
Consider the below query:
INSERT OVERWRITE TABLE newTable
SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;
Is the output result stored in HDFS?
  • A. Yes, inside newTable
  • B. Yes, inside shakespeare.
  • C. No, not at all.
  • D. Maybe, depends on the permission given to the client
Ans: a
Question #6
One of the business analyst in your organization has very good expertise on C coding. He wants to clean and model the business data which is stored in HDFS. Which of the among is best suited for him?
  • A. HIVE
  • B. PIG
  • C. MAPRDEDUCE
  • D. OOZIE
  • E. HadoopStreaming
Ans: c
Question #7
Which process describes the life cycle of a mapper?
  • A. The jobTracker calls the TaskTracker’s configure () method, then its map() method and finally its close() method.
  • B. Task Tracker spawns a new mapper process to process all records of a single InputSplit.
  • C. Task Tracker spawns a new mapper process to process each key-value pair.
  • D. JobTracker spawns a new mapper process to process all records of single input file.
Ans: c
Question #8
How does the NameNode detect that a DataNode has failed?
  • A. The NameNode does not need to know that DataNode has failed.
  • B. When the NameNode fails to receive periodic heartbeats from the DataNode, it considers the DataNode as failed.
  • C. The NameNode pings the datanode. If the DataNode does not respond, the NameNode considers the DataNode failed.
  • D. When HDFS starts up, the NameNode tries to communicate with the DataNode and considers the DataNodes failed if it does not respond.
Ans: b
Question #9
Two files needs to be joined over a common column. Which technique is faster and why?
  • A. The reduce-side joining is faster as it receives the records sorted by keys.
  • B. The reduce side joining is faster as it uses secondary sort.
  • C. The map-side joining faster as it caches the data from one file in-memory.
  • D. The map-side joining faster as it writes the intermediate data on local file system.
Ans: b
Question #10
You want to run two different jobs which may use same lookup data (For example, US state code). While submitting the first job you used the distributed cache to copy the lookup data file in each data node. Both the jobs have mapper configure method where the distributed file is retrieved programmatically and values are cached in a hash map. Both the job uses ToolRunner so that the file for distributed cache can be provided at the command prompt. You run the first job with data file passed to the distributed cache. When the job is complete you fire the second job without passing the lookup file to distributed cache. What is consequence? (Select one)
  • A. The first job runs but the second job fails. This is because, distributed cache is persistent as long as the job is not complete. After the job is complete the distributed cache gets removed.
  • B. The first and second job completes without any problem as Distributed caches are once set those are permanently copied.
  • C. The first and second job will be successfully completed if the number of reducer is set to zero. Because, distributed cache works only with map only jobs.
  • D. Both the jobs are successful if those are chained using chain mapper or chain reducer. Because, distributed cache only works with ChainMapper or ChainReducer.
Ans: d
Question #11
You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?
  • A. Run all the nodes in your production cluster as virtual machines on your development workstation.
  • B. Run the hadoop command with the –jt local and the –fs file:/// options.
  • C. Run the DataNode, TaskTracker, JobTracker and NameNode daemons on a single machine.
  • D. Run simpldoop, Apache open source software for simulating Hadoop cluster.
Ans: c
Question #12
MapReduce is well-suited for all of the following EXCEPT? (Choose one)
  • A. Text mining on large collections of unstructured documents.
  • B. Analysis of large amounts of web logs (queries, clicks etc.).
  • C. Online transaction processing (OLTP) for an e-commerce Website.
  • D. Graph mining on a large social network (e.g. Facebook friend’s network).
Ans: a
Question #13
Your cluster has 10 Datanodes, each with a single 1 TB hard drive. You utilize all your disk capacity for HDFS, reserving none for MapReduce. You implement default replication settings. What is the storage capacity of your Hadoop cluster (assuming no compression)?
  • A. About 3 TB
  • B. About 5 TB
  • C. About 10TB
  • D. About 11TB
Ans: c
Question #15
Combiners increase the efficiency of a MapReduce program because:
  • A. They provide a mechanism for different mappers to communicate with each other, thereby reducing synchronization overhead.
  • B. They provide an optimization and reduce the total number of computations that are needed to execute an algorithm by a factor of n, where n are the number of reducers.
  • C. They aggregate map output locally in each individual machine and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
  • D. They aggregate intermediate map output to a small number of nearby (i.e. rack local) machines and therefore reduce the amount of data that needs to be shuffled across the network to the reducers.
Ans: c
Question #16
When is the reduce method first called in a MapReduce Job?
  • A. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only and reduce-only jobs.
  • B. Reducers start copying the intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
  • C. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
  • D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
Ans: c
Question #17
Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot to schedule a MapReduce operation.
  • A. TaskTracker
  • B. NameNode
  • C. DataNode
  • D. JobTracker
  • E. Secondary Namenode
Ans:d
Question #18
What is the maximum limit for key-value pair that a mapper can emit ?
  • A. Its equivalent to number of lines in input files.
  • B. Its equivalent to number of times map() method is called in mapper task.
  • C. There is no such restriction. It depends on the use case and logic.
  • D. 1000
Ans:b
Question #19
What is the disadvantage of using multiple reducers with default HashPartioner and distributing your workload across your cluster.
  • A. You will not be able to compress your intermediate data.
  • B. You will no longer will be able to take the advantage of a Combiner.
  • C. The output files may not be in global sorted order.
  • D. There is no problem.
Ans: d
Question #20
You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?
  • A. Combiner
  • B. Mapper
  • C. Reducer
  • D. Reducer
  • E. Combiner
Ans:a
Question #21 (Bonus Question)
During the standard sort and shuffle phase of MapReduce, keys and values are passed to reducers. Which of the following is true?
  • A. Keys are presented to a reducer in sorted order; values for a given key are not sorted.
  • B. Keys are presented to a reducer in sorted order; values for a given key are sorted in ascending order.
  • C. Keys are presented to a reducer in random order; values for a given key are not sorted.
  • D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.
Ans:c

Also Read

Comments

Popular posts from this blog

Hyperledger Fabric Real Interview Questions Read Today

I am practicing Hyperledger. This is one of the top listed blockchains. This architecture follows R3 Corda specifications. Sharing the interview questions with you that I have prepared for my interview.

Though Ethereum leads in the real-time applications. The latest Hyperledger version is now ready for production applications. It has now become stable for production applications.
The Hyperledger now backed by IBM. But, it is still an open source. These interview questions help you to read quickly. The below set of interview questions help you like a tutorial on Hyperledger fabric. Hyperledger Fabric Interview Questions1). What are Nodes?
In Hyperledger the communication entities are called Nodes.

2). What are the three different types of Nodes?
- Client Node
- Peer Node
- Order Node
The Client node initiates transactions. The peer node commits the transaction. The order node guarantees the delivery.

3). What is Channel?
A channel in Hyperledger is the subnet of the main blockchain. You c…

Blue Prism complete tutorials download now

Blue prism is an automation tool useful to execute repetitive tasks without human effort. To learn this tool you need the right material. Provided below quick reference materials to understand detailed elements, architecture and creating new bots. Useful if you are a new learner and trying to enter into automation career.
The number one and most popular tool in automation is a Blue prism. In this post, I have given references for popular materials and resources so that you can use for your interviews.
RPA Blue Prism RPA blue prism tutorial popular resources I have given in this post. You can download quickly. Learning Blue Prism is a really good option if you are a learner of Robotic process automation.

RPA Advantages The RPA is also called "Robotic Process Automation"- Real advantages are you can automate any business process and you can complete the customer requests in less time.

The Books Available on Blue Prism 
Blue Prism resourcesDavid chappal PDF bookBlue Prism Blogs

8 Top Data Science Platform Developers in the World

Top data science tools and platforms providers across the world. Useful information for data science and data analytics developers.

Data Science is a combination of multiple skills. AI and Machine Learning are part of data science. You can create AI and Machine Learning products with data. 

Related Posts

Top Skills You Need for Data Science CareerData Science Sample Project an Example

Automation developer these are top Skills you need to learn

Robotic process automation is an upcoming IT skill. Three tools are popular. It is difficult to learn all three tool. So, learn anyone tool to start your career in automation.
To get a job in this line, I found in my research that some programming skills and Hand-on training on any one of the tools is required. Also, try to know the differences between popular RPA tools.
Skills Companies Looking in Automation Engineers All big companies looking for candidates having experience in Automation anywhere, Blue Prism and UIPath. It is not possible to learn all tools. Learn anyone tool and do practice well.

Ok.

You may ask a question about how to do it. Join in good training institute and learn one tool.  Take online classes to learn faster.

To learn Uipath try here. Also, you can enroll online course to learn UiPath.

UiPath GO The list of IT skills you needAutomation anywhere/Blue Prism/Uipath .Net/C#/Java/SQL skills MS-Visio Power Builder Python scripts/Unix Scripts/Perl Scripts HTML/CSS/J…

PL/SQL: Popular Reserved Words

Perfect in PL/SQL is an art. To become this you need to understand top reserved words and their meanings. The below list is useful for your projects.


Top List of PL/SQL Reserved Words.. Before you start knowing reserved words, wait one moment. The reserved words all are similar to words that you use in normal SQL. ALL*DESC*ISOLATIONOUTSQLERRMALTER*DISTINCT*JAVAPACKAGESTART*AND*DOLEVEL*PARTITIONSTDDEVANY*DROP*LIKE*PCTFREE*SUBTYPEARRAYELSE*LIMITEDPLS_INTEGERSUCCESSFUL*AS*ELSIFLOCK*POSITIVESUMASC*ENDLONG*POSITIVENSYNONYM*AUTHIDEXCEPTIONLOOPPRAGMASYSDATE*AVGEXCLUSIVE*MAXPRIOR*TABLE*BEGINEXECUTEMINPRIVATETHEN*BETWEEN*EXISTS*MINUS*PROCEDURETIMEBINARY_INTEGEREXITMINUTEPUBLIC*TIMESTAMPINTEGEREXTENDSMLSLABEL*RAISE

Three popular RPA tools functional differences

Robotic process automation is growing area and many IT developers across the board started up-skill in this popular area. I have written this post for the benefit of Software developers who are interested in RPA also called Robotic Process Automation.


In my previous post, I have described that total 12 tools are available in the market. Out of those 3 tools are most popular. Those are Automation anywhere, BluePrism and Uipath. Many programmers asked what are the differences between these tools. I have given differences of all these three RPA tools.

BluePrism Blue Prism has taken a simple concept, replicating user activity on the desktop, and made it enterprise strength. The technology is scalable, secure, resilient, and flexible and is supported by a comprehensive methodology, operational framework and provided as packaged software.The technology is developed and deployed within a “corridor of IT governance” and has sophisticated error handling and process modelling capabilities to ens…

SQL queries how to use built-in functions correctly

In this post I am giving top examples on SQL functions. About built-in functions, I have covered in my previous post.
List of examples:DATE-TIME functionsNumeric functionsChar functionsNull-related functions1. DATE/TIME functions SELECT p_code, launch_dt, CURRENT_DATE FROM product; CURRENT_DATE returns current date.
SELECT p_code, TO_CHAR(launch_dt, 'DD MONTH YYYY') reformatted_dt FROM product; TO_CHAR function returns DATE in CHAR format.
2. Numeric functions SELECT p_code, price, (price - 20), ABS(price - 20.00) FROM product; ABS - function returns absolute value
SELECT p_code, price, ROUND (price, 1) FROM product;
ROUND function - Round to 1 digit.
SELECT p_code, price, SIGN(price - 15) FROM product;
SIGN function - It returns sign; either + or -
SELECT p_code, price, TRUNC(price, 1) FROM product;
TRUNC function - Truncates to a certain number of decimal places.
3. CHAR functions SELECT p_code, CONCAT(CONCAT(p_name, ' -- ') , price) FROM product; CONCAT function - co…

Python Improved Logic Easy Way to Calculate Factorial

I am practicing Python programming. This post is you can write logic to calculate factorial in function. This function you can call it a user-defined function. The function name is 'factorial.py'. In real-time, you can write a program in a file and run it in python console. The main task of a developer is to create functions for the reusable code. They call these functions whenever they need. Factorial calculation program for supplied input value. Factorial Logic in Python I have completed this logic in 3 steps. Write factorial.pyImportExecute it Write Factorial.py  Here you need to define a function. Use 2 for loops, and write your logic. This is done on LInux operating system. You can also try on Linux.
After, ESC command Use, :wq to come out of the module. Import Factorial.py Go to Python console, using 'python' command. Use import factorial.py command.


Execute Factorial.py  >>> factorial.fact(5) It will show the result of factorial. Bottom line  Factorial o…