Skip to main content

Spark SQL Query how to write it in Ten steps

Spark SQL example
Spark SQL example
The post tells how to write SQL query in Spark and explained in ten steps.This example demonstrates how to use sqlContext.sql to create and load two tables and select rows from the tables into two DataFrames.

The next steps use the DataFrame API to filter the rows for salaries greater than 150,000 from one of the tables and shows the resulting DataFrame. Then the two DataFrames are joined to create a third DataFrame. Finally the new DataFrame is saved to a Hive table.

1. At the command line, copy the Hue sample_07 and sample_08 CSV files to HDFS:
$ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_07.csv /user/hdfs
$ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_08.csv /user/hdfs

where HUE_HOME defaultsto /opt/cloudera/parcels/CDH/lib/hue (parcel installation) or /usr/lib/hue
(package installation).

2. Start spark-shell:
$ spark-shell

3. Create Hive tables sample_07 and sample_08:

scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp
 int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
scala> sqlContext.sql("CREATE TABLE sample_08 (code string,description string,total_emp
 int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")

Also Read: Learn SparkSQL by your own with little money

4. In Beeline, show the Hive tables:
[0: jdbc:hive2://hostname.com:> show tables;
+------------+--+
| tab_name |
+------------+--+
16 | Spark Guide
Developing Spark Applications
| sample_07 |
| sample_08 |
+------------+--+

Also read: The role of Spark in Hadoop eco system

5. Load the data in the CSV files into the tables:
scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_07.csv' OVERWRITE INTO TABLE
 sample_07")
scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_08.csv' OVERWRITE INTO TABLE
 sample_08")

6. Create DataFrames containing the contents of the sample_07 and sample_08 tables:
scala> val df_07 = sqlContext.sql("SELECT * from sample_07")
scala> val df_08 = sqlContext.sql("SELECT * from sample_08")

Apache Spark
7. Show all rows in df_07 with salary greater than 150,000:
scala> df_07.filter(df_07("salary") > 150000).show()
The output should be:
+-------+--------------------+---------+------+
| code| description|total_emp|salary|
+-------+--------------------+---------+------+
|11-1011| Chief executives| 299160|151370|
|29-1022|Oral and maxillof...| 5040|178440|
|29-1023| Orthodontists| 5350|185340|
|29-1024| Prosthodontists| 380|169360|
|29-1061| Anesthesiologists| 31030|192780|
|29-1062|Family and genera...| 113250|153640|
|29-1063| Internists, general| 46260|167270|
|29-1064|Obstetricians and...| 21340|183600|
|29-1067| Surgeons| 50260|191410|
|29-1069|Physicians and su...| 237400|155150|
+-------+--------------------+---------+------+

8.Create the DataFrame df_09 by joining df_07 and df_08, retaining only the code and description columns.
scala> val df_09 = df_07.join(df_08, df_07("code") ===
df_08("code")).select(df_07.col("code"),df_07.col("description"))
scala> df_09.show()

The new DataFrame looks like:
+-------+--------------------+
| code| description|
+-------+--------------------+
|00-0000| All Occupations|
|11-0000|Management occupa...|
|11-1011| Chief executives|
|11-1021|General and opera...|
|11-1031| Legislators|
|11-2011|Advertising and p...|
|11-2021| Marketing managers|
|11-2022| Sales managers|
|11-2031|Public relations ...|
|11-3011|Administrative se...|
|11-3021|Computer and info...|
|11-3031| Financial managers|
|11-3041|Compensation and ...|
|11-3042|Training and deve...|
|11-3049|Human resources m...|
|11-3051|Industrial produc...|
|11-3061| Purchasing managers|
|11-3071|Transportation, s...|
|11-9011|Farm, ranch, and ...|
+-------+--------------------+

9. Save DataFrame df_09 as the Hive table sample_09:
scala> df_09.write.saveAsTable("sample_09")

10. In Beeline, show the Hive tables:
[0: jdbc:hive2://hostname.com:> show tables;
+------------+--+
| tab_name |
+------------+--+
| sample_07 |
| sample_08 |
| sample_09 |
+------------+--+

Comments

Popular posts from this blog

11 Top Blockchain Key Advantages to Read Now

Blockchain architecture changes the financial world in near future. Increasing population and volume of transactions cause financial crimes. Opportunities to implement Blockchain technology are Banks, Share markets, Government Bodies, and Big Corporations.  
Less maintenance and distributable made blockchain hot in the market. Why You Need BlockchainBlockchain stores each transaction in Blocks. No one can tamper or change the details. The people who are making a transaction in Blockchain world they both have same copies. No possibility of changing these records by parties involved. So it is robust.Key Advantages of BlockchainThe ledger details distributed.Distributed data available to all parties, and no one can tamper this data. Every transaction is Public. That means only people who have access can see the information. Stores all records permanently.No one can edit or manipulate the dataThe possibility is there to hack a centralized database. In Blockchain one cannot hack the data. S…

Blue Prism complete tutorials download now

Blue prism is an automation tool useful to execute repetitive tasks without human effort. To learn this tool you need the right material. Provided below quick reference materials to understand detailed elements, architecture and creating new bots. Useful if you are a new learner and trying to enter into automation career. The number one and most popular tool in automation is a Blue prism. In this post, I have given references for popular materials and resources so that you can use for your interviews.
RPA Blue Prism RPA blue prism tutorial popular resources I have given in this post. You can download quickly. Learning Blue Prism is a really good option if you are a learner of Robotic process automation.
RPA Advantages The RPA is also called "Robotic Process Automation"- Real advantages are you can automate any business process and you can complete the customer requests in less time.

The Books Available on Blue Prism 
Blue Prism resourcesDavid chappal PDF bookBlue Prism BlogsVi…

Three popular RPA tools functional differences

Robotic process automation is growing area and many IT developers across the board started up-skill in this popular area. I have written this post for the benefit of Software developers who are interested in RPA also called Robotic Process Automation.

In my previous post, I have described that total 12 tools are available in the market. Out of those 3 tools are most popular. Those are Automation anywhere, BluePrism and Uipath. Many programmers asked what are the differences between these tools. I have given differences of all these three RPA tools.

BluePrism Blue Prism has taken a simple concept, replicating user activity on the desktop, and made it enterprise strength. The technology is scalable, secure, resilient, and flexible and is supported by a comprehensive methodology, operational framework and provided as packaged software.The technology is developed and deployed within a “corridor of IT governance” and has sophisticated error handling and process modelling capabilities to ens…

R Vs SAS differences to read today

Statistical analysis should know by every software engineer. R is an open source statistical programming language. SAS is licensed analysis suite for statistics. The two are very much popular in Machine learning and data analytics projects.
SAS is analysis suite software and R is a programming language R ProgrammingR supports both statistical analysis and GraphicsR is an open source project.R is 18th most popular LanguageR packages are written in C, C++, Java, Python and.NetR is popular in Machine learning, data mining and Statistical analysis projects. SASSAS is a statistical analysis suite. Developed to process data sets in mainframe computers.Later developed to support multi-platforms. Like  Mainframe, Windows, and LinuxSAS has multiple products. SAS/ Base is very basic level.SAS is popular in data related projects. Learn SAS vs R Top Differences between SAS Vs R Programming SAS AdvantagesThe data integration from any data source is faster in SAS.The licensed software suite, so you…

Testing in DevOps to maximize Quality

Testing is the critical phase in DevOps. The process of DevOps is to speed up the deployment process. That means there are no shortcuts in testing. Covering most relevant test cases is the main thing the tester has to focus.
Requirements to Maximize QualityGood maintainable codeExhaustive coverage of casesTraining documents to Operations teamFewer bugs in the bug trackerLess complex and no redundant code Testing Activities in DevOpsThe team to use Tools to check the quality of codeStyle checker helps to correct code styleGood design avoids bugs in productionCode performance depends on the code-qualityBugs in production say poor testing  Tester Roles in DevOpsGood quality means zero bugs in production.Design requirements a base to validate testing results.Automated test scripts give quick feedback on the quality of code. Right test cases cover all the functional changes. The Bottom LineThe DevOps approach is seamless integration between Development and Operations without compromi…

Top Differences Read Today Agile vs Waterfall model

The Agile and Waterfall both models are popular in Software development. The Agile model is so flexible compared to waterfall model. Top differences on Waterfall vs Agile give you clear understanding on both the processes. Waterfall ModelThe traditional model is waterfall. It has less flexibility.Expensive and time consuming model.Less scalable to meet the demand of customer requirements.The approach is top down. Starting from requirements one has to finish all the stages, till deployment to complete one cycle.A small change in requirement, one has to follow all the stages till deployment.Waterfall model creates idleness in resource management. Agile ModelAgile model is excellent for rapid deployment of small changesThe small split-requirements you can call them as sprintsLess idleness in resource management.Scope for complete team involvement.Faster delivery makes client happy.You can deploy changes related to compliance or regulatory quickly.Collaboration improves among the team.