Showing posts with the label Big data

Featured Post

Best Practices for Handling Duplicate Elements in Python Lists

Here are three awesome ways that you can use to remove duplicates in a list. These are helpful in resolving your data analytics solutions.  01. Using a Set Convert the list into a set , which automatically removes duplicates due to its unique element nature, and then convert the set back to a list. Solution: original_list = [2, 4, 6, 2, 8, 6, 10] unique_list = list(set(original_list)) 02. Using a Loop Iterate through the original list and append elements to a new list only if they haven't been added before. Solution: original_list = [2, 4, 6, 2, 8, 6, 10] unique_list = [] for item in original_list:     if item not in unique_list:         unique_list.append(item) 03. Using List Comprehension Create a new list using a list comprehension that includes only the elements not already present in the new list. Solution: original_list = [2, 4, 6, 2, 8, 6, 10] unique_list = [] [unique_list.append(item) for item in original_list if item not in unique_list] All three methods will result in uni

5 Python File Modes You Need

Here're top five Python file modes explained. The purpose is to open, read, write the files. There are occasions you need to deal with data, which is present in the files. You need to give correct file-modes to handle the files in Python. Python file open modes 5 Python File Modes You Need Here's an example code how to you can use file mode:   filename = input ( 'Enter a filename : ' ) f1 = open (filename, 'mode' ) 1- Python File mode w   To open the file for writing, you need 'w' mode. The beauty of this is If the file does not exist, it creates one. This mode's purpose is to write the file. If you try to read, you will get an error. 2- Python File mode w+ To Open the file for Reading and Writing, you need 'w+' mode. For instance, you used w+, you have tried to read the file - after writing, it displays blank. The reason is after writing cursor position will point at the end of the file. 3- Python File mode a It appends the records at the

Top Niche Skills You Need for Big Data Career

The following niche skills recruiter looking in Big Data Anaytics professionals: Creativity Analytical Skills Business Analysis Skills Business Intelligence Skills IT Technical Skills Related: Hot IT Skills in 2016

Big data: Quiz-2 Hadoop Top Interview Questions

I hope you enjoyed my previous post. This is second set of Questions exclusively for Big data engineers. Read QUIZ-1 . Q.1) You have submitted a job on an input file which has 400 input splits in HDFS. How many map tasks will run? A. At most 400. B. At least 400 C. Between 400 and 1200. D. Between 100 and 400. Ans: c QUESTION 2 What is not true about LocalJobRunner mode? Choose two A. It requires JobTracker up and running. B. It runs Mapper and Reducer in one single process C. It stores output in local file system D. It allows use of Distributed Cache. Ans: d,a Hadoop Jobs and Career QUESTION 3 What is the command you will use to run a driver named “SalesAnalyisis” whose compilped code is available in a jar file “SalesAnalytics.jar” with input data in directory “/sales/data” and output in a directory “/sales/analytics”? A. hadoopfs  –jar  SalesAnalytics.jar  SalesAnalysis  -input  /sales/data  -output /sales/analysis B. hadoopfs  jar  SalesAnalytics.jar

Hadoop Bigdata a Quick Story for Dummies

Mike Olson is one of the fundamental brains behind the Hadoop development. Yet even he looks at the new type of "Big Data" programming utilized inside Google. Mike Olson runs an organization that represents considerable authority on the planet's most sultry programming. He's the CEO of Cloudera, a Silicon Valley startup that arrangements in Hadoop, an open source programming stage focused around tech that transformed Google into the most predominant drive on the web. Hadoop is relied upon to fuel an $813 million product advertise by the year 2016 . In any case even Olson says it’s as of now old news. Hadoop sprung from two exploration papers Google distributed in late 2003 and 2004. One portrayed the Google File System, a method for putting away enormous measures of data crosswise over a great many extremely inexpensive machine servers, and the other nitty gritty Mapreduce, which pooled the preparing power inside each one of those servers and crunched all that

Apache HIVE Top Features

Apache Hive aids the examination of great datasets kept in Hadoop’s HDFS and harmonious file setups such as the Amazon S3 filesystem. It delivers an SQL-like lingo named when keeping complete aid aimed at map/reduce. To accelerate requests, it delivers guides, containing bitmap guides. By preset, Hive stores metadata in an implanted Apache Derby database, and different client/server databases like MySQL may optionally be applied. Currently, there are 4 file setups maintained in Hive, which are TEXTFILE, SEQUENCE FILE, ORC, and RCFILE. Other attributes of Hive include: Indexing to supply quickening, directory sort containing compacting, and Bitmap directory as of 0.10, further directory kinds are designed. Different depository kinds such as simple written material, RCFile, HBase, ORC, and other ones. Metadata depository in an RDBMS, notably decreasing the time to accomplish verbal examines throughout request implementation. Operating on compressed information kept into the H

What is Cluster- In the age of Big data and Analytics

A cluster is local in that all of its component subsystems are supervised within a single administrative domain, usually residing in a single room and managed as a single computer system. The constituent computer nodes are commercial-off-the-shelf (COTS), are capable of full independent operation as is, and are of a type ordinarily employed individually for standalone mainstream workloads and applications. (Cluster in Hadoop- Career options) The nodes may incorporate a single microprocessor or multiple microprocessors in a symmetric multiprocessor (SMP) configuration. The interconnection network employs COTS local area network (LAN) or systems area network (SAN) technology that may be a hierarchy of or multiple separate network structures. A cluster network is dedicated to the integration of the cluster compute nodes and is separate from the cluster's external (worldly) environment. A cluster may be employed in many modes including but no limited to: high capability or

Big data benefits in Education field- A data driven approach

Netflix can predict what movie you should watch next and Amazon can tell what book you'll want to buy. With Big Data learning analytics, new online education platforms can predict which learning modules students will respond better to and help get students back on track before they drop out. (Big data Hadoop career) That's important given that the United States has the highest college dropout rate of any OECD (Organisation for Economic Co-operation and Development) country, with just 46% of college entrants completing their degree programs. In 2012, the United States ranked 17th in reading, 20th in science, and 27th in math in a study of 34 OECD countries.The country's rankings have declined relative to previous years. Many students cite the high cost of education as the reason they drop out. At private for-profit schools, 78% of attendees fail to graduate after six years compared with a dropout rate of 45% for students in public colleges, according to a study by

5 Top Data warehousing Skills in the age of Big data

#5 Top Data warehousing Skills in the age of Big data: A data warehouse is a home for "secondhand" data that originates in either other corporate applications, such as the one your company uses to fill customer orders for its products, or some data source external to your company, such as a public database that contains sales information gathered from all your competitors. What is Data warehousing If your company's data warehouse were advertised as a used car, for example, it may be described this way: "Contains late-model, previously owned data, all of which has undergone a 25-point quality check and is offered to you with a brand-new warranty to guarantee hassle-free ownership." Most organizations build a data warehouse in a relatively straightforward manner: The data warehousing team selects a focus area, such as tracking and reporting the company's product sales activity against that of its competitors. The team in charge of building the da

8 Top key points in Apache Cassandra in the age of Big data

(Hadoop questions...) Decentralized:  Every knot within the array has the similar part. There is no sole point of letdown. Data is dispersed athwart the array (so every one node holds dissimilar data), however there is no principal as any knot may facility whatever appeal. Supports replication and multi information centre replication: Replication strategic plans are configurable. Cassandra is developed like a dispersed configuration, for distribution of great numerals of nodes athwart numerous information hubs. Key attributes of Cassandra’s dispersed design are especially custom-made for multiple-data centre distribution, for superfluity, for a procedure by which a system automatically transfers control to a duplicate system when it detects a fault or failure and calamity recuperation. Hadoop+Interview+Questions+Part-1 Scalability:  Read and record output either rise linearly as spic-and-span devices are appended, with no layoff either discontinuity to applications. Fault

Use of Solr in the age of Big data

Lucene is a search library whereas Solr is the Web Application built on top of Lucene which simplify the use of underlying search features. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. 30+Highest+Paying+Jobs Solr is highly scalable, providing distributed search and index replication. It powers the search and navigation features of many of the world's largest Internet sites. Integration of Big Data into Data warehousing: Data Layer Technology Layer Analytics layer

15 Awesome Features Should Present in Big Data System

Really good post. I have given useful points on the features of big data system. If there are no right features, you will miss the benefits that you get from big data. What does traditional BI tools.... Read next step... Traditional tools quickly can become overwhelmed by the large volume of big data. Latency—the time it takes to access the data—is as an important a consideration as volume. A little difference is there... Suppose you might need to run an ad hoc query against the large data set or a predefined report. A large data storage system is not a data warehouse, however, and it may not respond to queries in a few seconds. It is, rather, the organization-wide repository that stores all of its data and is the system that feeds into the data warehouses for management reporting. Image courtesy| Big data needs to be considered in terms of how the data will be manipulated. The size of the data set will impact data capture, movement, storag

Big Data: Mobility Solutions for the Retail Industry

#Big Data: Mobility Solutions for the Retail Industry: When it comes to retail solutions, mobility has multiple dimensions. Depending on the type of services and the beneficiaries, mobility in the retail industry can be grouped into the following two categories: Mobility solutions for retailers (enterprise solutions) Store inventory management solutions-  Each retail chain or store is unique in its operations. Depending on the type of the products sold or services offered (apparel retailer, footwear retailer, grocery retailer, electronic goods retailer, pharmacy retailer, general merchandise retailer, etc.) and the general demand of a product, the retailers may stock a large volume or a limited quantity of products, and it is essential to keep track of the inventory details for each product for on-time reordering of the products to avoid any possible product out-of-stock situations. AIDC technologies such as barcode labels and RFID tags can be easily combined with mobilit

Social Media and Mobile Technology for Health care

(People also click these jobs to  know skill set  and to apply  even from your phone!!) The ubiquity of mobile phone accessibility around the world is increasing. Worldwide the number of mobile phones in use grew from fewer than 1 billion in 2000 to around 6 billion in 2012. Recent estimates conclude that over 75% of the world' s population have access to a mobile phone (World Bank, 2012). Globally, there has been a rapid rise in the use of smart phones by consumers with over 1 billion Smart Phones subscribers (Approximately 30% of smartphone users are likely to use wellness apps by 2015, (Bjornland, Goh, Haanæs, Kainu, & Kennedy, 2012) with more than 30 billion mobile applications being downloaded in 2011 (World Bank, 2012). Along with this increase in penetration, there has been a significant increase in the development and deployment of mobile software applications across multiple computing platforms (e.g. smart phones, tablets and laptops). The most pop

How to achieve Virtualization in cloud computing real ideas

In order to run applications on a Cloud, one needs a flexible middleware that eases the development and the deployment process. Middleware Approach to Deploy Application on Cloud GridGain provides a middleware that aims to develop and run applications on both public and private Clouds without any changes in the application code.  It is also possible to write dedicated applications based on the map/reduce programming model. Although GridGain provides a mechanism to seamlessly deploy applications on a grid or a Cloud, it does not support the deployment of the infrastructure itself. It does, however, provide protocols to discover running GridGain nodes and organize them into topologies (Local Grid, Global Grid, etc.) to run applications on only a subset of all nodes. Elastic Grid infrastructure provides dynamic allocation, deployment, and management of Java applications through the Cloud.  It also offers a Cloud virtualization layer that abstracts specific Cloud computing provide

Hadoop: How to find which file is healthy

Hadoop provides file system health check utility which is called "fsck". Basically, it checks the health of all the files under a path It also checks the health of all the files under the '/'(root). BIN/HADOOP fsck / - It checks the health of all the files BIN/HADOOP fsck /test/ - It checks the health of files under the path By default fsck utility cannot do anything for under replicated blocks and over replicated blocks. Hadoop itself heal the blocks.   How to find which file is healthy It prints out dot for each healthy file It will print a message for each file, if it is not healthy, also for under replicated blocks, over replicated blocks, mis-replicated blocks, and corrupted blocks. By default fsck utility cannot do anything for under replicated blocks and over replicated blocks. Hadoop itself heal the blocks. How to delete corrupted blocks BIN/HADOOP fsck -delete block-names It will delete all corrupted blocks BIN/HADOOP fsck -m

Understand Data power why quality everyone wants

Information and data quality is new service work for data intense companies. I have seen not only in Analytics projects but in Mainframe projects, there is the Data Quality team. How incorrect data impact on us Information quality problems and their impact are all around us: A customer does not receive an order because of incorrect shipping information. Products are sold below cost because of wrong discount rates. A manufacturing line is stopped because parts were not ordered—the result of inaccurate inventory information. A well-known U.S. senator is stopped at an airport (twice) because his name is on a government "Do not fly" list. Many communities cannot run an election with results that people trust. Financial reform has created new legislation such as Sarbanes—Oxley.  Incorrect data leads to many problems. The role of Data Science is to use quality data for effective decisions. What is information Information is not simply data, strings of numbers, lis

Big Data: Top Hadoop Interview Questions (4 of 5)

1) What is MAP Reduce program? - You need to give actual steps in this program - You have to write scripts and codes 2) What is MAPReduce? -Mapreduce is a data processing model -It is combination of 2 parts. One is Mappers and the other one is Reducers 3)What will happen in Mapping phase? It takes the input data, and feeds each data element into the mapper 4)What is the function of Reducer? The reducer process all outputs from mapper and arrives at a final result 5)What kind of input required for Mapreduce? It should be structured in the form of (Key,Value) pairs 6)What is HDFS? HDFS is a file system designed for large-scale data processing under frameworks such as MapReduce. 7) Is HDFS like UNIX? No, but commands in HDFS works similarly to UNIX 8) What is Simple file command? hadoop fs -ls 9) How to copy data into HDFS file system? Copy a file into HDFS from local system 10) What is default working directory in HDFS? /user/$USER $USER ==> Your log

Big Data: IBM InfoSphere BigInsights Basics

I am explaining here why you need IBM infoSphere. You all know about what is file system in Hadoop. Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure. In simpler terms, just imagine that you've got dozens, or even hundreds (or thousands!) of individual computers racked and networked together. Each computer (often referred to as a node in Hadoop-speak) has its own processors and a dozen or so 2TB or 3TB hard disk drives. All of these nodes are running software that unifies them into a single cluster, where, instead of seeing the individual computers, you see an extremely large volume where you can store your data. The beauty of this Hadoop system is that you can store anything in this space: millions of digital image scans of mortgage contracts, days and weeks of security camera footage, trillions of sensor-generated log records, or all of the operator transcription notes from a call center

Big Data: Top Cloud Computing Interview Questions (1 of 4)

The below are frequently asked interview questions on Cloud computing: 1) What is the difference between Cloud and Grid? Grid: -Information service -Security Service -Data management -Execution Manageement Cloud: - Maintains up-to-date information of resources -Create VMs according to user requirement -Application deploment -User management 2) What are the different cloud standards? -Interoperability standards -Security standards -Portability Standards -Governance and Risk standards 3) What are the two different sub-systems in Cloud computing ? -Management sub system -Resource sub system 4)What is Cloud compouting? The promise of cloud computing is ubiquitous access to a broad set of applications and services, which are delivered over the network to multiple customer. 5) Why we need specialized network for Cloud services? The public Internet is the simplest choice for delivering cloud-based services. In this model, the cloud provider simply purchases Inter

Big Data: Top NoSQL Interview Questions (2 of 5)

1) What is most important character of NoSQL? High Availability 2)Different types of NoSQL databases? Key-Value stores Column Stores Graph Stores Document Stores 3)What is oracle NoSQL database? Oracle NoSQL Database is a distributed key-value database designed to provide highly reliable, scalable, and available data storage across a configurable set of systems. 4)What is the DB engine being used in Oracle NoSQL database? Oracle NoSQL Database uses Oracle Berkeley DB Java Edition as the underlying data storage engine. 5)What is oracle NoSQL database? Oracle NoSQL Database is a shared-nothing system designed to run and scale on commodity hardware. Key-value pairs are hash partitioned across server groups known as shards. At any point in time, a single key-value pair is always associated with a unique shard in the system. 6) What are unique features of Oracle NoSQL? Oracle NoSQL Database leverages the high availability features in Berkeley DB in order to provide res