Posts

Showing posts with the label Hadoop

Limitation of map side joining: A record being processed by a mapper may be joined with a record not easily accessible (or even located) by that mapper. This is the main limitation. Who will facilitate map side join: Hadoop's apache.hadoop.mapred.join package contains helper classes to facilitate this map side join. What is joining data in Hadoop: You will come across, you need to analyze data from multiple sources, this scenario Hadoop follows data joining. In the case database world, joining of two or more tables is called joining. In Hadoop joining data involved different approaches. Approaches: Reduce side join Replicated joins using a Distributed cache Semijoin-Reduce side join with map side filtering What is the functionality of Map-reduce job: The traditional MapReduce job reads a set of input data, performs some transformations in the map phase, sorts the results, performs another transformation in the reduce phase, and writes a set of output data. The...

How to verify SSH Installed in Hadoop Cluster Quickly

- September 21, 2015

Below command helps, whether SSH is installed or not on your Hadoop cluster. [hadoop-user@master]$ which ssh /user/bin/bash [hadoop-user@master] $ which sshd /user/bin/sshd [hadoop-user@master] $ which ssh -keygen /user/bin/sshd If you do not get proper response as above. That means that SSH is not installed on your cluster. Resolution: If you receive an error message /user/bin/which: no ssh in (/user/bin: /user/sbin....) You need to install open SSH (www.openssh.com) vial Linux package manager. Or by downloading the source directly. Note: This is usually done by System Admin.

How to Use Help Command in HDFS

- September 21, 2015

Sometimes as a Hadoop developer it is difficult to remember all the Hadoop commands. The HELP command useful to know the correct syntax. ---- How to List all HDFS Commands hadoop fs ==> Enter This will list all Hadoop commands. Help Command in HDFS Hadoop commands are the flavor of UNIX. If you want to see each Command description, you can go for Hadoop help command. You can use the below command for help. hadoop fs -hlep ls Deleting a File in Hadoop HDFS The below command helps how to delete a file from Hadop cluster. hadoop fs -rm exmaple.txt

How to Setup Hadoop Cluster Top Ideas

- September 13, 2015

Hadoop cluster setup in Centos Operating System explained in this post. So you can install CentOs either in your Laptop or in Virtual Machine. Hadoop Cluster Setup Process 9 Steps Process to Setup Hadoop Cluster Step 1: Installing Sun Java on Linux. Commands to execute for the same: sudo apt-add-repository ppa:flexiondotorg/java sudo apt-get update sudo apt-get install sun-java6-jre sun-java6-plugin sudo update-java-alternatives -s java-6-sun Step 2: Create Hadoop User. Commands to execute for the same: $sudo addgroup hadoop $sudo adduser —ingroup hadoop hduser Step 3: Install SSH Server if not already present. Commands are: $ sudo apt-get install openssh-server $ su - hduser $ ssh-keygen -t rsa -P "" $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Step 4: Installing Hadoop. Commands for the same are: $wget http://www.eng.lsu.edu/mirrors/apache/hadoop/core/hadoop-0.22.0/hadoop-0.22.0.tar.gz $ cd /home/hduser $ tar xzf ...

Hadoop Bigdata a Quick Story for Dummies

- September 10, 2015

Mike Olson is one of the fundamental brains behind the Hadoop development. Yet even he looks at the new type of "Big Data" programming utilized inside Google. Mike Olson runs an organization that represents considerable authority on the planet's most sultry programming. He's the CEO of Cloudera, a Silicon Valley startup that arrangements in Hadoop, an open source programming stage focused around tech that transformed Google into the most predominant drive on the web. Hadoop is relied upon to fuel an $813 million product advertise by the year 2016 . In any case even Olson says it’s as of now old news. Hadoop sprung from two exploration papers Google distributed in late 2003 and 2004. One portrayed the Google File System, a method for putting away enormous measures of data crosswise over a great many extremely inexpensive machine servers, and the other nitty gritty Mapreduce, which pooled the preparing power inside each one of those servers and crunched all that ...

Hadoop HDFS Comics to Understand Quickly

- August 26, 2015

HDFS file system in Hadoop helps to store data supplied as input. Its fault-tolerant nature avoids data loss. About HDFS, the real story of fault-tolerant given in Comic book for you to understand in less time. What is HDFS in Hadoop HDFS is optimized to support high-streaming read performance, and this comes at the expense of random seek performance. This means that if an application is reading from HDFS, it should avoid (or at least minimize) the number of seeks. Sequential reads are the preferred way to access HDFS files. HDFS supports only a limited set of operations on files — writes, deletes, appends, and reads, but not updates. It assumes that the data will be written to the HDFS once, and then read multiple times. HDFS does not provide a mechanism for local caching of data. The overhead of caching is large enough that data should simply be re-read from the source, which is not a problem for applications that are mostly doing sequential reads of large-sized da...

5 Essential features of HBASE Storage Architecture

- August 26, 2015

Many analytics prgrammers have confusion about HBASE. The question is if we have HDFS, then why we need HBASE. This post covers how HBASE and HDFS are related in HADOOP big data framework. HBase is a distributed, versioned, column-oriented, multidimensional storage system, designed for high performance and high availability. To be able to successfully leverage HBase, you first must understand how it is implemented and how it works. A region server's implementation can have: HBase is an open source implementation of Google's BigTable architecture. Similar to traditional relational database management systems (RDBMSs), data in HBase is organized in tables. Unlike RDBMSs, however, HBase supports a very loose schema definition, and does not provide any joins, query language, or SQL. Although HBase does not support real-time joins and queries, batch joins and/or queries via MapReduce can be easily implemented. In fact, they are well-supported by higher-level s...

Apache HIVE Top Features

- August 24, 2015

Apache Hive aids the examination of great datasets kept in Hadoop’s HDFS and harmonious file setups such as the Amazon S3 filesystem. It delivers an SQL-like lingo named when keeping complete aid aimed at map/reduce. To accelerate requests, it delivers guides, containing bitmap guides. By preset, Hive stores metadata in an implanted Apache Derby database, and different client/server databases like MySQL may optionally be applied. Currently, there are 4 file setups maintained in Hive, which are TEXTFILE, SEQUENCE FILE, ORC, and RCFILE. Other attributes of Hive include: Indexing to supply quickening, directory sort containing compacting, and Bitmap directory as of 0.10, further directory kinds are designed. Different depository kinds such as simple written material, RCFile, HBase, ORC, and other ones. Metadata depository in an RDBMS, notably decreasing the time to accomplish verbal examines throughout request implementation. Operating on compressed information kept into the H...

What is Cluster- In the age of Big data and Analytics

- August 24, 2015

A cluster is local in that all of its component subsystems are supervised within a single administrative domain, usually residing in a single room and managed as a single computer system. The constituent computer nodes are commercial-off-the-shelf (COTS), are capable of full independent operation as is, and are of a type ordinarily employed individually for standalone mainstream workloads and applications. (Cluster in Hadoop- Career options) The nodes may incorporate a single microprocessor or multiple microprocessors in a symmetric multiprocessor (SMP) configuration. The interconnection network employs COTS local area network (LAN) or systems area network (SAN) technology that may be a hierarchy of or multiple separate network structures. A cluster network is dedicated to the integration of the cluster compute nodes and is separate from the cluster's external (worldly) environment. A cluster may be employed in many modes including but no limited to: high capability or...

12 Top Hadoop Security Interview Questions

- August 23, 2015

Here are the interview questions on Hadoop security. Useful to learn for your data science project and for interviews. 12 Hadoop Security Interview Questions How does Hadoop security work? How do you enforce access control to your data? How can you control who is authorized to access, modify, and stop Hadoop MapReduce jobs? How do you get your (insert application here) to integrate with Hadoop security controls? How do you enforce authentication for users on all types of Hadoop clients (for example, web consoles and processes)? How can you ensure that rogue services don't impersonate real services (for example, rogue Task Trackers and tasks, unauthorized processes presenting block IDs to Data Nodes to get access to data blocks, and so on)? Can you tie in your organization's Lightweight Directory Access Protocol (LDAP) directory and user groups to Hadoop's permissions structure? Can you encrypt data in transit in Hadoop? Can your data be encrypted at rest on HDFS? How can ...

Big data benefits in Education field- A data driven approach

- August 22, 2015

Netflix can predict what movie you should watch next and Amazon can tell what book you'll want to buy. With Big Data learning analytics, new online education platforms can predict which learning modules students will respond better to and help get students back on track before they drop out. (Big data Hadoop career) That's important given that the United States has the highest college dropout rate of any OECD (Organisation for Economic Co-operation and Development) country, with just 46% of college entrants completing their degree programs. In 2012, the United States ranked 17th in reading, 20th in science, and 27th in math in a study of 34 OECD countries.The country's rankings have declined relative to previous years. Many students cite the high cost of education as the reason they drop out. At private for-profit schools, 78% of attendees fail to graduate after six years compared with a dropout rate of 45% for students in public colleges, according to a study by...

The best helpful hdfs file system commands (3 of 4)

- August 19, 2015

dus- hadoop fs -dus PATH dus reports the sum of the file sizes in aggregate rather than individually. expunge- hadoop fs -expunge Empties the trash. If the trash feature is enabled, when a file is deleted, it is first moved into the temporary Trash/folder. The file will be permanently deleted from the Trash/folder only after user-configurable delay. get - hadoop -fs -get [-ignorecrc] [-crc] SRC LOCASDST Copies files to the local file system.

Use of Solr in the age of Big data

- August 14, 2015

Lucene is a search library whereas Solr is the Web Application built on top of Lucene which simplify the use of underlying search features. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. 30+Highest+Paying+Jobs Solr is highly scalable, providing distributed search and index replication. It powers the search and navigation features of many of the world's largest Internet sites. Integration of Big Data into Data warehousing: Data Layer Technology Layer Analytics layer

15 Awesome Features Should Present in Big Data System

- August 11, 2015

Really good post. I have given useful points on the features of big data system. If there are no right features, you will miss the benefits that you get from big data. What does traditional BI tools.... Read next step... Traditional tools quickly can become overwhelmed by the large volume of big data. Latency—the time it takes to access the data—is as an important a consideration as volume. A little difference is there... Suppose you might need to run an ad hoc query against the large data set or a predefined report. A large data storage system is not a data warehouse, however, and it may not respond to queries in a few seconds. It is, rather, the organization-wide repository that stores all of its data and is the system that feeds into the data warehouses for management reporting. Image courtesy|Stockphotos.io Big data needs to be considered in terms of how the data will be manipulated. The size of the data set will impact data capture, movement, storag...

Here's Quick Guide on Hadoop Security

- August 08, 2015

Here is a topic of security and tools in Hadoop. These are security things that everyone needs to take care of while working with the Hadoop cluster. Hadoop Security Security We live in a very insecure world. For instance, your home's front door to all-important virtual keys, your passwords, everything needs to be secured. In Big data systems, where humongous amounts of data are processed, transformed, and stored. So security you need for the data . Imagine if your company spent a couple of million dollars installing a Hadoop cluster to gather and analyze your customers' spending habits for a product category using a Big Data solution. Here lack of data security leads to customer apprehension. Security Concerns Because that solution was not secure, your competitor got access to that data, and your sales dropped 20% for that product category. How did the system allow unauthorized access to data? Wasn't there any authentication mechanism in place? Why were there no alerts? Th...

Microsoft HDInsight for Hadoop Cluster

- August 07, 2015

HDInsight is Microsoft's implementation of a Big Data solution with Apache Hadoop at its core. HDInsight is 100 percent compatible with Apache Hadoop and is built on open source components in conjunction with Hortonworks, a company focused toward getting Hadoop adopted on the Windows platform. HDInsight Microsoft Initiate Basically, Microsoft has taken the open source Hadoop project, added the functionalities needed to make it compatible with Windows (because Hadoop is based on Linux), and submitted the project back to the community. All of the components are retested in typical scenarios to ensure that they work together correctly and that there are no versioning or compatibility issues. Features Microsoft's Hadoop-based distribution brings the robustness, manageability, and simplicity of Windows to the Hadoop environment. The focus is on hardening security through integration with Active Directory, thus making it enterprise ready, simplifying manageability through in...

Advanced Oozie for Software developers (Part 1 of 3)

- August 06, 2015

Introduction to Oozie Places or points of interest in specific locations that may be important to some people. Those locations are additionally associated with data that explains what is interesting or important about them. How People Gather Data? These are typically locations where people come for entertainment, interaction, services, education, and other types of social activities. Examples of places include restaurants, museums, theaters, stadiums, hotels, landmarks, and so on. Many companies gather data about places and use this data in their applications. In the telecommunications industry, probes are small packages of information sent from mobile devices. The majority of "smartphones" send probes regularly when the device is active and is running a geographical application (such as maps, navigation, traffic reports, and so on). The probe frequency varies for different providers (from 5 seconds to 30 seconds). Probes are normally directed to phone carriers su...

Hadoop: How to find which file is healthy

- July 22, 2015

Hadoop provides file system health check utility which is called "fsck". Basically, it checks the health of all the files under a path It also checks the health of all the files under the '/'(root). BIN/HADOOP fsck / - It checks the health of all the files BIN/HADOOP fsck /test/ - It checks the health of files under the path By default fsck utility cannot do anything for under replicated blocks and over replicated blocks. Hadoop itself heal the blocks. How to find which file is healthy It prints out dot for each healthy file It will print a message for each file, if it is not healthy, also for under replicated blocks, over replicated blocks, mis-replicated blocks, and corrupted blocks. By default fsck utility cannot do anything for under replicated blocks and over replicated blocks. Hadoop itself heal the blocks. How to delete corrupted blocks BIN/HADOOP fsck -delete block-names It will delete all corrupted blocks BIN/HADOOP fsck -m...

Big Data: Top Hadoop Interview Questions (4 of 5)

- July 12, 2015

1) What is MAP Reduce program? - You need to give actual steps in this program - You have to write scripts and codes 2) What is MAPReduce? -Mapreduce is a data processing model -It is combination of 2 parts. One is Mappers and the other one is Reducers 3)What will happen in Mapping phase? It takes the input data, and feeds each data element into the mapper 4)What is the function of Reducer? The reducer process all outputs from mapper and arrives at a final result 5)What kind of input required for Mapreduce? It should be structured in the form of (Key,Value) pairs 6)What is HDFS? HDFS is a file system designed for large-scale data processing under frameworks such as MapReduce. 7) Is HDFS like UNIX? No, but commands in HDFS works similarly to UNIX 8) What is Simple file command? hadoop fs -ls 9) How to copy data into HDFS file system? Copy a file into HDFS from local system 10) What is default working directory in HDFS? /user/$USER $USER ==> Your log...

Search This Blog

ApplyBigAnalytics