The Ultimate Cheat Sheet On Hadoop

Top 20 frequently asked questions to test your Hadoop knowledge given in the below Hadoop cheat sheet. Try finding your own answers and match the answers given here.

Question #1 

You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network?

A. Writable
B. WritableComparable
C. InputFormat
D. OutputFormat
E. Combiner
F. Partitioner
Ans: e

Question #2 

Where is Hive metastore stored by default ?

B. In client machine in the form of a flat file.
C. In client machine in a derby database
D. In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modified.
Ans: c


Social Analytics - How Marketers Will Use

Of all the windows through which a business can peer into an audience, seems most enticing. The breadth of subjects, range of observations, and, above all, the ability to connect and draw inferences make hugely exciting for anyone who is interested in understanding and influencing past, present and potential customers, employees, or even investors.

As individuals leave traces of their activities - personal, social and professional - on the internet, they allow an unprecedented view into their lives, thoughts, influences and preferences. Social analytics attempts to draw useful understanding and inferences, which could be relevant to marketers, sales persons, HR managers, product designers, investors and so on. Thus, as social tools like Facebook, Twitter, LinkedIn, WhatsApp, and many more, host a plethora of social activities of many people, a humongous amount of data is generated about people's preferences, behaviour and sentiments. Like any data, it is amenable to analysis to gain useful insights.

The challenge comes from the sheer volume, velocity, and variety. It is very difficult to ensure that the analysis is relevant and reliable. Besides the daunting technical intricacies of setting up the appropriate analytics, the aspects of choosing information sources, filtering the right data, and its interpretation and aggregation are susceptible to errors and biases. For example, some social activities are relatively easier to access (like activity on Twitter, or public updates on Facebook), many are not. Some types of data (like text, or location) are easy to search and interpret, many (like pictures) are not. So a good analysis model must judiciously compensate for the nature of the sources included, and hence it could be at times very difficult to assess if the analysis is useful or just meaningless mumbo-jumbo.

