Running Hive Queries Using Spark SQL


Running Hive Queries Using Spark SQL



Hive queries using Spark SQL. This instructional blog post explores how it can be done.
Prerequisite:
  1. Hadoop Cluster configured in your system
  2. Hive installed and configured with Hadoop
  3. Spark installed on the top of Hadoop eco-system
Some more configurations need to be done after the successful configuration of Hadoop, Hive, and Spark.
Open your spark-env.sh file which is present in the $SPARK_HOME/conf directory and open the spark-env.sh file. Here, add the HIVE_HOME as shown below.
1
export HIVE_HOME=/path_to_hive_installed_directory

Now copy the hive-site.xml file present in the $HIVE_HOME/conf directory to the $SPARK_HOME/conf directory.
In hive-site.xml at  $SPARK_HOME/conf directory Change hive.execution engine from tez to mr
  • <property>
  •     <name>hive.execution.engine</name>
  •     <value>mr</value>
  •   </property>
Note:
Because Spark,mr and tez these are execution engines that runs over YARN therefore when you want to run your job over spark then your engine should be spark not tez. Also a job you can only run it in single engine not on multiple engines.
Now, we will check for the tables inside Hive first. Let’s start the Hive shell first.
*Note: Make sure that the Hive Metastore service is running. Next, start it using the command hive –service metastore
In the above screenshot, you can see the list of tables present in our Hive. Here we have loaded the Olympics data. We have created queries in the first 5 rows of the Olympics table using the following command.
1
select * from olympics limit 5;

Now let us check for the same by querying from the Spark Shell.
Now, start the Spark Shell and import the Hive context. Create one variable for the HiveContext as follows:
12
3
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._

You can see the same in the following screen shot.
Now let us see the list of tables present in Hive from the Spark Shell. Use the following command to view the list.

1
val hive_tables=hiveContext.sql("show tables").foreach(println)
 
Now let us query on the olympics table from the Spark Shell.
1
val first_5_rows = hiveContext.sql("select * from olympics limit 5").foreach(println)

In the above screen shot, you can see the first 5 rows of the dataset. Like this, we can execute any kind of queries on Hive data using the Spark-SQL engine.
Let us now create one table in spark shell and let us see whether will it get reflected in the hive or not.
Following is the way you can create a table in Hive through Spark Shell.
1
val create_table = hiveContext.sql("create table spark_olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile")

We have created a table with the name spark_olympic but we haven’t loaded any data into this table.
Let us see whether it will get reflected in the Hive shell or not.
In the above screenshot, you can see the table spark_olympic is in the Hive Shell and there is no data in it.
Now, let us load the data from the Spark Shell into this table and query the data from Hive.
This is how we can load data into a Hive table from a Spark Shell.
1
val load_data = hiveContext.sql("load data local inpath '/home/kiran/Desktop/olympix_data.csv' into table spark_olympic")

Now let us query the data in the Hive Shell from Spark.
In the above screen shot, you can see that we have queries that recently loaded data.
We can run all the Hive queries on the Hive tables using the Spark SQL engine.

Comments

  1. I went through your blog and I must say that you have done a great job in explaining all aspects of digital marketing clearly. Would highly appreciate if you could spare some time and take a look at my page.

    Hadoop Training In Navalur

    java training in navalur

    ReplyDelete
  2. Thank you for sharing such great information very useful to us.
    Big Data Training in Gurgaon

    ReplyDelete
  3. Good job! Fruitful article. I like this very much. It is very useful for my research. It shows your interest in this topic very well. I hope you will post some more information about the software. Please keep sharing!!
    Hadoop Training in Chennai
    Big Data Training in Chennai
    Devops Training in Chennai
    Digital Marketing Course in Chennai
    RPA Training in Chennai
    SEO Training in Chennai
    Hadoop Training in Tambaram
    Hadoop Training in Porur

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. thanks for sharing .

    http://splashs.in/

    ReplyDelete
  6. Thank you - Just shared this post with a colleague who would benefit from reading this, really enjoyed it. Read more about big data hadoop online training from our website.

    ReplyDelete
  7. Thank you for info..
    Really this blog looks awesome,keep sharing more blog posts with us.

    big data online training

    ReplyDelete
  8. It's Really A Great Post. interesting and very useful links. Thanks for sharing. oracle training in chennai

    ReplyDelete

Post a Comment

Popular posts from this blog

Insert Postgresql database into Elasticsearch Using Logstash

Rest API Java + Postgresql