Skip to main content


Showing posts from August, 2015

How IBM Cloudscape supports online and offline Backups

IBM Cloudscape is a Java-enabled Object Relational Database Management System (ORDBMS). It supports complex SQL statements, transaction management, and provides multi-user support in the Database Management System (DBMS).

You can create two types of backups of the Cloudscape database, online backup and offline backup. You can make backups of a database and the log file of the database. When restoring the database, you need to restore the log file and the database.

The below picture shows structure of Cloudscape system.

You need to understand the structure and contents of a Cloudscape system in order to make backups and restore Cloudscape databases.

A Cloudscape system consists of a system directory, one or more Cloudscape databases, and the system properties and configuration settings.

A system directory contains a properties file called and an information and error log file called db2j.log. Each database in a Cloudscape system is stored in a subdirectory, which has t…

How PhoneGap is useful to create cross-platform mobile Applications (1 of 2)

There are many smartphone platforms on the market: Android, iPhone, BlackBerry, Nokia, the Windows 7 Phone, and WebOS. Newer platforms are on the rise as well, such as Samsung's Bada and Meego.

The sheer number of development platforms for mobile applications may seem overwhelming. This is the first of many points you must keep in mind when dealing with mobile application development.

In the year 2000, we saw a similar situation in the desktop world. We had Microsoft Windows, Apple's Mac, and various versions of Linux and UNIX. At that time, it was difficult to build products that would run on all these platforms. The resulting fragmentation was often solved via in-house solutions by building frameworks in C++, with Operating System (OS)-specific modules abstracted. Fortunately, Sun's Java came to the rescue and provided us with a common platform on which to build. With Java's build-once-and-run-anywhere strategy, building desktop products had become a breeze.


Top key points about Matlab Software Package (1 of 2)

Matlab is probably the world's most successful commercial numerical analysis software package, and its name is derived from the term "matrix laboratory." It provides an interactive development tool for scientific and engineering problems and more generally for those areas where significant numeric computations have to be performed.

The package can be used to evaluate single statements directly or a list of statements called a script can be prepared. Once named and saved, a script can be executed as an entity. The package was originally based on software produced by the LINPACK and EISPACK projects but currently includes LAPACK and BLAS libraries which represent the current "state-of-the-art" numerical software for matrix computations. Matlab provides the user with

Easy manipulation of matrix structuresA vast number of powerful built-in routines that are constantly growing and developingPowerful two- and three-dimensional graphing facilitiesA scripting system tha…

Udemy Top demanding IT Skills and Jobs

IT and Software
CCNP wireless certification
Ethical hacking
Internet of Things
ITIL foundation certification
Network+ 2015

Construct 3
Hadoop / Big Data/MapReduce/Apache hive & pig
Laravel 5
React Native
Unreal Engine 4

Business Intelligence
CPA exam
Project Management PMP
Tableau 9.0
Business Analysis
Big Data for Business

How Hadoop distributed file system HDFS really works

What is HDFS in Hadoop - HDFS is optimized to support high-streaming read performance, and this comes at the expense of random seek performance. This means that if an application is reading from HDFS, it should avoid (or at least minimize) the number of seeks. Sequential reads are the preferred way to access HDFS files.

HDFS supports only a limited set of operations on files — writes, deletes, appends, and reads, but not updates. It assumes that the data will be written to the HDFS once, and then read multiple times.

HDFS does not provide a mechanism for local caching of data. The overhead of caching is large enough that data should simply be re-read from the source, which is not a problem for applications that are mostly doing sequential reads of large-sized data files.

Access complete HDFS details here.

5 Essential features of HBASE Storage Architecture

Many analytics prgrammers have confusion about HBASE. The question is if we have HDFS, then why we need HBASE. This post covers how HBASE and HDFS are related in HADOOP big data framework.
HBase is a distributed, versioned, column-oriented, multidimensional storage system, designed for high performance and high availability. To be able to successfully leverage HBase, you first must understand how it is implemented and how it works.
A region server's implementation can have:

HBase is an open source implementation of Google's BigTable architecture. Similar to traditional relational database management systems (RDBMSs), data in HBase is organized in tables. Unlike RDBMSs, however, HBase supports a very loose schema definition, and does not provide any joins, query language, or SQL.

Although HBase does not support real-time joins and queries, batch joins and/or queries via MapReduce can be easily implemented. In fact, they are well-supported by higher-level systems such as Pig and Hiv…

5 Essential features of SAS Visual Analytics in the age of big data

SAS Visual Analytics is an easy to use, web-based product that leverages SAS highperformance analytic technologies and empowers organizations to explore huge volumes of data very quickly in order to see patterns and trends, and to identify opportunities for further analysis.
SAS Visual Data Builder enables users to summarize data, join data, and enhance the predictive power of their data. Users can prepare data for exploration and mining quickly and easily.  The highly visual, drag and drop data interface of SAS Visual Analytics Explorer combined with the speed of the SAS LASR Analytic Server accelerates analytic computations and enables organizations to derive value from massive amounts of data. This creates an unprecedented ability to solve difficult problems, improve business performance, and mitigate risk rapidly and confidently. SAS Visual Analytics Designer enables users to quickly create reports or dashboards, which can be viewed on a mobile device or on the web.

SAS Visual Ana…

Top Apache HIVE excellent built-in features for Big data

Apache Hive aids examination of great datasets kept in Hadoop’s HDFS and harmonious file setups such like Amazon S3 filesystem.

It delivers an SQL-like lingo named when keeping complete aid aimed at map/reduce. To accelerate requests, it delivers guides, containing bitmap guides.

By preset, Hive stores metadata in an implanted Apache Derby database, and different client/server databases like MySQL may optionally be applied.

Currently, there are 4 file setups maintained in Hive, that are TEXTFILE, SEQUENCE FILE, ORC and RCFILE.

Other attributes of Hive include:

Indexing to supply quickening, directory sort containing compacting and Bitmap directory as of 0.10, further directory kinds are designed.Different depository kinds such like simple written material, RCFile, HBase, ORC, and other ones.Metadata depository in an RDBMS, notably decreasing the time to accomplish verbal examines throughout request implementation.Operating on compressed information kept in to Hadoop environment, set of…

What is Cluster- In the age of Big data and Analytics

A cluster is local in that all of its component subsystems are supervised within a single administrative domain, usually residing in a single room and managed as a single computer system.

The constituent computer nodes are commercial-off-the-shelf (COTS), are capable of full independent operation as is, and are of a type ordinarily employed individually for standalone mainstream workloads and applications.

The nodes may incorporate a single microprocessor or multiple microprocessors in a symmetric multiprocessor (SMP) configuration.

The interconnection network employs COTS local area network (LAN) or systems area network (SAN) technology that may be a hierarchy of or multiple separate network structures. A cluster network is dedicated to the integration of the cluster compute nodes and is separate from the cluster's external (worldly) environment.

A cluster may be employed in many modes including but no limited to: high capability or sustained performance on a single problem, hig…

What are the frequently asked questions on Hadoop security

How does Hadoop security work?How do you enforce access control to your data?How can you control who is authorized to access, modify, and stop Hadoop MapReduce jobs?How do you get your (insert application here) to integrate with Hadoop security controls?How do you enforce authentication for users on all types of Hadoop clients (for example, web consoles and processes)?How can you ensure that rogue services don't impersonate real services (for example, rogue TaskTrackers and tasks, unauthorized processes presenting block IDs to DataNodes to get access to data blocks, and so on)?Can you tie in your organization's Lightweight Directory Access Protocol (LDAP) directory and user groups to Hadoop's permissions structure?Can you encrypt data in transit in Hadoop?Can your data be encrypted at rest on HDFS?How can you apply consistent security controls to your Hadoop cluster?What are the best practices for security in Hadoop today?Are there proposed changes to Hadoop's security…

Top 5 Key points on Virtual Private Cloud

What is virtual private cloud: The concept of a virtual private cloud (VPC) has emerged recently. In a typical approach, a VPC connects an organization's information technology (IT) resources to a dynamically allocated subset of a cloud provider's resources via a virtual private network (VPN).

Organizational IT controls are then applied to the collective resources to meet required service levels. As a result, in addition to improved TCO, the model promises organizations direct control of security, reliability and other attributes they have been accustomed to with conventional, internal data centers.

The VPC concept is both fundamental and transformational. First, it proposes a distinct abstraction of public resources combined with internal resources that provides equivalent functionality and assurance to a physical collection of resources operated for a single organization, wherein the public resources may be shared with many other organizations that are also simultaneously be…

Big data benefits in Education field- A data driven approach

Netflix can predict what movie you should watch next and Amazon can tell what book you'll want to buy.

With Big Data learning analytics, new online education platforms can predict which learning modules students will respond better to and help get students back on track before they drop out.

That's important given that the United States has the highest college dropout rate of any OECD (Organisation for Economic Co-operation and Development) country, with just 46% of college entrants completing their degree programs. In 2012, the United States ranked 17th in reading, 20th in science, and 27th in math in a study of 34 OECD countries.The country's rankings have declined relative to previous years.

Many students cite the high cost of education as the reason they drop out. At private for-profit schools, 78% of attendees fail to graduate after six years compared with a dropout rate of 45% for students in public colleges, according to a study by the Pew Research Center.

Among 18 …

5 Top Data warehousing Skills in the age of Big data

A data warehouse is a home for "secondhand" data that originates in either other corporate applications, such as the one your company uses to fill customer orders for its products, or some data source external to your company, such as a public database that contains sales information gathered from all your competitors.

What is Data warehousing

If your company's data warehouse were advertised as a used car, for example, it may be described this way: "Contains late-model, previously owned data, all of which has undergone a 25-point quality check and is offered to you with a brand-new warranty to guarantee hassle-free ownership."

Most organizations build a data warehouse in a relatively straightforward manner:
The data warehousing team selects a focus area, such as tracking and reporting the company's product sales activity against that of its competitors.The team in charge of building the data warehouse assigns a group of business users and other key individua…

8 Top key points in Apache Cassandra in the age of Big data

Decentralized: Every knot within the array has the similar part. There is no sole point of letdown. Data is dispersed athwart the array (so every one node holds dissimilar data), however there is no principal as any knot may facility whatever appeal.

Supports replication and multi information centre replication: Replication strategic plans are configurable. Cassandra is developed like a dispersed configuration, for distribution of great numerals of nodes athwart numerous information hubs. Key attributes of Cassandra’s dispersed design are especially custom-made for multiple-data centre distribution, for superfluity, for a procedure by which a system automatically transfers control to a duplicate system when it detects a fault or failure and calamity recuperation.


Scalability: Read and record output either rise linearly as spic-and-span devices are appended, with no layoff either discontinuity to applications.

Fault-tolerant: Data is automatedly cloned to …

5 Top features of Sqoop in the age of Big data

‘Sqoop’ is a command-line user interface program for conveying information amid relational databases and Hadoop.

It aids increasing stacks of a sole table either a gratis shape SQL request as well like preserved appointments that may be run numerous periods to ingress upgrades produced to a database eversince the final ingress.
Imports may as well be applied to inhabit boards in Apache Hive|Hive either HBase. Exports may be applied to put information as of Hadoop in to a relational database.

Sqoop grew to be a top-level Apache Software Foundation|Apache program in March 2012.
Microsoft utilizes a Sqoop-based connector to aid transference information as of Microsoft SQL Server databases to Hadoop.
Couchbase, Inc. As well delivers a Couchbase Server-Hadoop connector by intents of Sqoop.

5 Top features of Columnar Databases (1 of 2 )

The traditional RDBMS - Since the days of punch cards and magnetic tapes, files have been physically contiguous bytes that are accessed from start (open file) to finish (end-of-file flag = TRUE).

Yes, the storage could be split up on a disk and the assorted data pages connected by pointer chains, but it is still the same model. Then the file is broken into records (more physically contiguous bytes), and records are broken into fields (still more physically contiguous bytes).

A file is processed in record by record (read/fetch next) or sequentially navigated in terms of a physical storage location (go to end of file, go back/forward n records, follow a pointer chain, etc.). There is no parallelism in this model. There is also an assumption of a physical ordering of the records within the file and an ordering of fields within the records.
A lot of time and resources have been spent sorting records to make this access practical; you did not do random access on a magnetic tape and you co…

5 Top features of MongoDB (1 of 5)

The most important of the philosophies that underpin MongoDB is the notion that one size does not fit all. For many years, traditional SQL databases (MongoDB is a document-orientated database) have been used for storing content of all types. It didn't matter whether the data was a good fit for the relational model (which is used in all RDBMS databases, such as MySQL, PostgresSQL, SQLite, Oracle, MS SQL Server, and so on). The data was stuffed in there, anyway.

Part of the reason for this is that, generally speaking, it's much easier (and more secure) to read and write to a database than it is to write to a file system. If you pick up any book that teaches PHP (such as PHP for Absolute Beginners (Apress, 2009)) by Jason Lengstorf, you'll probably find that almost right away the database is used to store information, not the file system. 

It's just so much easier to do things that way. And while using a database as a storage bin works, developers always have to work agains…

The best helpful hdfs file system commands (3 of 4)


hadoop fs -dus PATH

dus reports the sum of the file sizes in aggregate rather than individually.


hadoop fs -expunge

Empties the trash. If the trash feature is enabled, when a file is deleted, it is first moved into the temporary Trash/folder. The file will be permanently deleted from the Trash/folder only after user-configurable delay.

get -

hadoop -fs -get [-ignorecrc] [-crc] SRC

Copies files to the local file system.

All about Data Vault Business Intelligence system

Data Vault 2.0 (DV2) is a system of business intelligence that includes: modeling, methodology, architecture, and implementation best practices. The components, also known as the pillars of DV2 are identified as follows:

DV2 Modeling (changes to the model for performance and scalability)DV2 Methodology (following Scrum and agile best practices)DV2 Architecture (including NoSQL systems and Big Data systems)DV2 Implementation (pattern-based, automation, generation Capability Maturity Model Integration [CMMI] level 5) There are many special aspects of Data Vault, including the modeling style for the enterprise data warehouse. The methodology takes commonsense lessons from software development best practices such as CMMI, Six Sigma, total quality management (TQM), Lean initiatives, and cycle-time reduction and applies these notions for repeatability, consistency, automation, and error reduction.

Each of these components plays a key role in the overall success of an enterprise data warehou…

UNIX Shell script to know input is integer or not

Unix Shell script to know input is integer or not.

$vi prg1 clear echo "enter a number" read x y='expr $x % 2' if test $y -eq 0 then echo "Number is even" else echo "Number is odd" fiRunning Script $sh prg1 enter a number 11 Number is odd $sh prg1 enter a number 12 Number is even

The best Hadoop architecture selected new questions

The web site defines Hadoop as "a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models." Quite simply, that's the philosophy: to provide a framework that's simple to use, can be scaled easily, and provides fault tolerance and high availability for production usage.

The idea is to use existing low-cost hardware to build a powerful system that can process petabytes of data very efficiently and quickly.

More :Top selected Hadoop Interview Questions

Hadoop achieves this by storing the data locally on its DataNodes and processing it locally as well. All this is managed efficiently by the NameNode, which is the brain of the Hadoop system. All client applications read/write data through NameNode.

Hadoop has two main components: the Hadoop Distributed File System (HDFS) and a framework for processing large amounts of data in parallel using the MapReduce paradigm


HDFS is a dis…

Use of Solr in the age of Big data

Lucene is a search library whereas Solr is the Web Application built on top of Lucene which simplify the use of underlying search features.

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.


Solr is highly scalable, providing distributed search and index replication.

It powers the search and navigation features of many of the world's largest Internet sites.

Integration of Big Data into Data warehousing:
Data LayerTechnology LayerAnalytics layer

How to use Machine learning skills for Internet of Things IoT

Big data relates to extremely large and complex data sets which are difficult to process using traditional means.

Machine Learning comprises algorithms which learn from data, make predictions based on their learning, and have the ability to improve their outcomes with experience. Due to the enormity of data involved with Machine Learning, various technologies and frameworks have been developed to address the same. Hadoop is an open-source framework targeted for commodity hardware to address big data scale.

The distributed design of the Hadoop framework makes it an excellent fit to crunch data and draw insights from it by unleashing Machine Learning algorithms on it.
So, the true value of IoT comes from ubiquitous sensors’ relaying of data in real-time, getting that data over to Hadoop clusters in a central processing unit, absorbing the same, and performing Machine Learning on data to draw insights; all at petabyte scale or more.

In reviewing the use cases and challenges from precedin…

Greenplum Database basics in the age of Hadoop (1 of 2)

The Greenplum Database constructs on the basis of open origin database PostgreSQL. It firstly purposes like a information storage and uses a shared-nothing architecture|shared-nothing, astronomically collateral (computing)|massively collateral handling (MPP) design.
How Greenplum works... In this design, information is partitioned athwart numerous section servers, and every one section controls and commands a clearly different part of the altogether data; there is no disk-level parting nor information argument amid sections.
Greenplum Database’s collateral request optimizer changes every one request into a material implementation design. Greenplum’s optimizer utilizes a cost-based set of rules to appraise prospective implementation designs, bears a worldwide view of implementation athwart the computer array, and circumstances in the charges of moving information amid knots.

The ensuing request designs hold customary relational database transactions like well like collateral motion tran…

15 Awesome Features Should Present in Big Data System

Really good post. I have given useful points on the features of big data system. If there are no right features, you will miss the benefits that you get from big data.
What does traditional BI tools....
Read next step...
Traditional tools quickly can become overwhelmed by the large volume of big data. Latency—the time it takes to access the data—is as an important a consideration as volume.

A little difference is there...
Suppose you might need to run an ad hoc query against the large data set or a predefined report.
A large data storage system is not a data warehouse, however, and it may not respond to queries in a few seconds. It is, rather, the organization-wide repository that stores all of its data and is the system that feeds into the data warehouses for management reporting. Big data needs to be considered in terms of how the data will be manipulated. The size of the data set will impact data capture, movement, storage, processing, presentation, analytics, reporting, and latency…

What is Data visualization- Basics

Nowadays we are flooded with data of diverse kinds due to the increasing computational capability and accessibility. Specifically, in addition to public data available on the Internet (e.g., census, demographics, environmental data), data pertaining personal daily activities are now more easily collected, for example, through mobile devices that can log people's running distances and time or their manual record of nutrition consumption.

Due to such expanded sources of data, there appear new applications that involve data collection, visualization, exploration, and distribution in daily contexts. These applications do not only display static information but also let users navigate the data in forms of interactive visualizations.

This emerging trend has brought both opportunities and challenges to interaction designers to develop new approaches to designing data-based applications.

Conveying information has been one of main functions of graphic and communication design since the ana…

The best visualization tool Tableau for Software Developers (1 of 2)

Why Tableau: 
Companies that have invested millions of dollars in BI systems are using spreadsheets for data analysis and reporting.

When BI system reports are received, traditional tools often employ inappropriate visualization methods. People want to make informed decisions with reliable information. They need timely reports that present the evidence to support their decisions. They want to connect with a variety of datasources, and they don't know the best ways to visualize data. Ideally, the tool used should automatically present the information using the best practices.

3 Kinds of Data

Known Data (type 1)
Encompassed in daily, weekly, and monthly reports that are used for monitoring activity, these reports provide the basic context used to inform discussion and frame questions. Type 1 reports aren't intended to answer questions. Their purpose is to provide visibility of operations.

Data YOU Know YOU need to Know (type 2)
Once patterns and outliers emerge in type 1 data the q…

The awesome basics about Hadoop Security

What is Hadoop Security

We live in a very insecure world. Starting with the key to your home's front door to those all-important virtual keys, your passwords, everything needs to be secured—and well. In the world of Big Data where humongous amounts of data are processed, transformed, and stored, it's all the more important to secure your data.

Imagine if your company spent a couple of million dollars installing a Hadoop cluster to gather and analyze your customers' spending habits for a product category using a Big Data solution.

Because that solution was not secure, your competitor got access to that data and your sales dropped 20% for that product category.

How did the system allow unauthorized access to data? 
Wasn't there any authentication mechanism in place? 
Why were there no alerts? This scenario should make you think about the importance of security, especially where sensitive data is involved.
Although Hadoop does have inherent security concerns due to its distri…

How to run HDInsight Hadoop on windows for data analysis

HDInsight is Microsoft's implementation of a Big Data solution with Apache Hadoop at its core. HDInsight is 100 percent compatible with Apache Hadoop and is built on open source components in conjunction with Hortonworks, a company focused toward getting Hadoop adopted on the Windows platform.

Basically, Microsoft has taken the open source Hadoop project, added the functionalities needed to make it compatible with Windows (because Hadoop is based on Linux), and submitted the project back to the community. All of the components are retested in typical scenarios to ensure that they work together correctly and that there are no versioning or compatibility issues.

Microsoft's Hadoop-based distribution brings the robustness, manageability, and simplicity of Windows to the Hadoop environment. The focus is on hardening security through integration with Active Directory, thus making it enterprise ready, simplifying manageability through integration with System Center 2012, and dramati…

Advanced Oozie for Software developers (Part 1 of 3)

Introduction to Oozie

Places or points of interest are specific locations that may be important to some people. Those locations are additionally associated with data that explains what is interesting or important about them.

Networking basics for IoT developers

These are typically locations where people come for entertainment, interaction, services, education, and other types of social activities. Examples of places include restaurants, museums, theaters, stadiums, hotels, landmarks, and so on. Many companies gather data about places and use this data in their applications.

In the telecommunications industry, probes are small packages of information sent from mobile devices. The majority of "smartphones" send probes regularly when the device is active and is running a geographical application (such as maps, navigation, traffic reports, and so on).

The probe frequency varies for different providers (from 5 seconds to 30 seconds). Probes are normally directed to phone carriers s…

The best helpful HDFS File System Commands (2 of 4)

CopyFrom Local
Works similarly to the put command, except that the source is restricted to a local file reference.
hdfs dfs -copyFromLocal URI
hdfs dfs -copyFromLocal input/docs/data2.txt hdfs://localhost/user/rosemary/data2.txt

HDFS Commands Part-1of 4

Works similarly to the get command, except that the destination is restricted to a local file reference.
hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI
hdfs dfs -copyToLocal data2.txt data2.copy.txt

Counts the number of directories, files, and bytes under the paths that match the specified file pattern.
hdfs dfs -count [-q]
hdfs dfs -count hdfs:// hdfs://

Copies one or more files from a specified source to a specified destination. If you specify multiple sources, the specified destination must be a directory.
hdfs dfs -cp URI [URI …]
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

Displays the size of the specified file, or the sizes of files and direct…

The best helpful hdfs file system commands (1 of 4)

hadoop fs -cat FILE [ ... ]
Displays the file content. For reading compressed files, you should use the TEXT command instead.

hadoop fs -chgrp [-R] GROUP PATH [ PATH....]

Changes the group association for files and directories. The -R option applies the change recursively.

Big Data: Mobility Solutions for the Retail Industry

When it comes to retail solutions, mobility has multiple dimensions. Depending on the type of services and the beneficiaries, mobility in the retail industry can be grouped into the following two categories:
Mobility solutions for retailers (enterprise solutions)

Store inventory management solutions- Each retail chain or store is unique in its operations. Depending on the type of the products sold or services offered (apparel retailer, footwear retailer, grocery retailer, electronic goods retailer, pharmacy retailer, general merchandise retailer, etc.) and the general demand of a product, the retailers may stock a large volume or a limited quantity of products, and it is essential to keep track of the inventory details for each product for on-time reordering of the products to avoid any possible product out-of-stock situations. AIDC technologies such as barcode labels and RFID tags can be easily combined with mobility solutions to implement efficient and cost-effective store inventory m…