Skip to main content

Spark SQL Query how to write it in Ten steps

Spark SQL example
Spark SQL example
The post tells how to write SQL query in Spark and explained in ten steps.This example demonstrates how to use sqlContext.sql to create and load two tables and select rows from the tables into two DataFrames.

The next steps use the DataFrame API to filter the rows for salaries greater than 150,000 from one of the tables and shows the resulting DataFrame. Then the two DataFrames are joined to create a third DataFrame. Finally the new DataFrame is saved to a Hive table.

1. At the command line, copy the Hue sample_07 and sample_08 CSV files to HDFS:
$ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_07.csv /user/hdfs
$ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_08.csv /user/hdfs

where HUE_HOME defaultsto /opt/cloudera/parcels/CDH/lib/hue (parcel installation) or /usr/lib/hue
(package installation).

2. Start spark-shell:
$ spark-shell

3. Create Hive tables sample_07 and sample_08:

scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp
 int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
scala> sqlContext.sql("CREATE TABLE sample_08 (code string,description string,total_emp
 int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")

Also Read: Learn SparkSQL by your own with little money

4. In Beeline, show the Hive tables:
[0: jdbc:hive2://hostname.com:> show tables;
+------------+--+
| tab_name |
+------------+--+
16 | Spark Guide
Developing Spark Applications
| sample_07 |
| sample_08 |
+------------+--+

Also read: The role of Spark in Hadoop eco system

5. Load the data in the CSV files into the tables:
scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_07.csv' OVERWRITE INTO TABLE
 sample_07")
scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_08.csv' OVERWRITE INTO TABLE
 sample_08")

6. Create DataFrames containing the contents of the sample_07 and sample_08 tables:
scala> val df_07 = sqlContext.sql("SELECT * from sample_07")
scala> val df_08 = sqlContext.sql("SELECT * from sample_08")

Apache Spark
7. Show all rows in df_07 with salary greater than 150,000:
scala> df_07.filter(df_07("salary") > 150000).show()
The output should be:
+-------+--------------------+---------+------+
| code| description|total_emp|salary|
+-------+--------------------+---------+------+
|11-1011| Chief executives| 299160|151370|
|29-1022|Oral and maxillof...| 5040|178440|
|29-1023| Orthodontists| 5350|185340|
|29-1024| Prosthodontists| 380|169360|
|29-1061| Anesthesiologists| 31030|192780|
|29-1062|Family and genera...| 113250|153640|
|29-1063| Internists, general| 46260|167270|
|29-1064|Obstetricians and...| 21340|183600|
|29-1067| Surgeons| 50260|191410|
|29-1069|Physicians and su...| 237400|155150|
+-------+--------------------+---------+------+

8.Create the DataFrame df_09 by joining df_07 and df_08, retaining only the code and description columns.
scala> val df_09 = df_07.join(df_08, df_07("code") ===
df_08("code")).select(df_07.col("code"),df_07.col("description"))
scala> df_09.show()

The new DataFrame looks like:
+-------+--------------------+
| code| description|
+-------+--------------------+
|00-0000| All Occupations|
|11-0000|Management occupa...|
|11-1011| Chief executives|
|11-1021|General and opera...|
|11-1031| Legislators|
|11-2011|Advertising and p...|
|11-2021| Marketing managers|
|11-2022| Sales managers|
|11-2031|Public relations ...|
|11-3011|Administrative se...|
|11-3021|Computer and info...|
|11-3031| Financial managers|
|11-3041|Compensation and ...|
|11-3042|Training and deve...|
|11-3049|Human resources m...|
|11-3051|Industrial produc...|
|11-3061| Purchasing managers|
|11-3071|Transportation, s...|
|11-9011|Farm, ranch, and ...|
+-------+--------------------+

9. Save DataFrame df_09 as the Hive table sample_09:
scala> df_09.write.saveAsTable("sample_09")

10. In Beeline, show the Hive tables:
[0: jdbc:hive2://hostname.com:> show tables;
+------------+--+
| tab_name |
+------------+--+
| sample_07 |
| sample_08 |
| sample_09 |
+------------+--+

Comments

Popular posts

Blue Prism complete tutorials download now

RPA blue prsim tutorial popular resources I have given in this post. You can download quickly.Learning Blue Prism is really good option if you are learner of Robotic process automation. The RPA is also called "Robotic Process Automation"- Real advantages are you can automate any business process and you can complete the customer requests in less time.

The Books Available on Blue Prism 
Blue Prism resourcesDavid chappal PDF bookBlue Prism BlogsVideo Training
RPA training The other Skills you need
Basic business skills and Domain skills are more than enough to be successful in this automation careerScripting languages like Perl/JSON/JavaScript/VBScript.  The interesting point is learning any RPA tool is not a problem. You can learn tool quickly. The real point is how quickly you apply your knowledge to implement automated tasks is important.


Also read
Robotic RPA Software developer skills you needBlue Prism tutorials download to learn quicklyPopular RPA tools functionality differen…

Three popular RPA tools functional differences

Robotic process automation is growing area and many IT developers across the board started up-skill in this popular area. I have written this post for the benefit of Software developers who are interested in RPA also called Robotic Process Automation.

In my previous post, I have described that total 12 tools are available in the market. Out of those 3 tools are most popular. Those are Automation anywhere, BluePrism and Uipath. Many programmers asked what are the differences between these tools. I have given differences of all these three RPA tools.

BluePrismBlue Prism has taken a simple concept, replicating user activity on the desktop, and made it enterprise strength. The technology is scalable, secure, resilient, and flexible and is supported by a comprehensive methodology, operational framework and provided as packaged software.The technology is developed and deployed within a “corridor of IT governance” and has sophisticated error handling and process modelling capabilities to ensu…

Robotic RPA Software developer skills you need

Robotic process automation is an upcoming and becoming most popular skill. As I said there are three popular tools. To become proficient in any one of the tool is really good to get a job in Developer role.
To get a job in this line, I found in my research that some programming skills and Hand-on training on any one of the tools is required. Also, try to to know differences in other popular rpa tools.

Most people are asking experience in tools like Automation anywhare, Blue Prism and Uipath. But, you cannot be proficient in all. So just know what are the differences. Ok...
You may ask a question like how to know. First join one good coaching institute and learn one tool perfectly. And start taking online training. Really good for you. Whatever you are lacking quickly you can learn online way.

To learn Uipath try here. Also, you can learn Automation anywhere tool online way.

The following are the list of IT skills commonly asking:
Automation anywhere/Blue Prism/Uipath.Net/C#/Java/SQL ski…