Running Hive Queries Using Spark SQL

Hive queries using Spark SQL. This instructional blog post explores how it can be done.

Prerequisite:

Hadoop Cluster configured in your system

Hive installed and configured with Hadoop

Spark installed on the top of Hadoop eco-system

Some more configurations need to be done after the successful configuration of Hadoop, Hive, and Spark.

Open your spark-env.sh file which is present in the $SPARK_HOME/conf directory and open the spark-env.sh file. Here, add the HIVE_HOME as shown below.

1

export HIVE_HOME=/path_to_hive_installed_directory

Now copy the hive-site.xml file present in the $HIVE_HOME/conf directory to the $SPARK_HOME/conf directory.

In hive-site.xml at $SPARK_HOME/conf directory Change hive.execution engine from tez to mr

<property>

    <name>hive.execution.engine</name>

    <value>mr</value>

  </property>

Note:

Because Spark,mr and tez these are execution engines that runs over YARN therefore when you want to run your job over spark then your engine should be spark not tez. Also a job you can only run it in single engine not on multiple engines.

Now, we will check for the tables inside Hive first. Let’s start the Hive shell first.

*Note: Make sure that the Hive Metastore service is running. Next, start it using the command hive –service metastore

In the above screenshot, you can see the list of tables present in our Hive. Here we have loaded the Olympics data. We have created queries in the first 5 rows of the Olympics table using the following command.

1

select * from olympics limit 5;

Now let us check for the same by querying from the Spark Shell.

Now, start the Spark Shell and import the Hive context. Create one variable for the HiveContext as follows:

12

3

import org.apache.spark.sql.hive.HiveContext

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

import hiveContext._

You can see the same in the following screen shot.

Now let us see the list of tables present in Hive from the Spark Shell. Use the following command to view the list.

1

val hive_tables=hiveContext.sql("show tables").foreach(println)

Now let us query on the olympics table from the Spark Shell.

1

val first_5_rows = hiveContext.sql("select * from olympics limit 5").foreach(println)

In the above screen shot, you can see the first 5 rows of the dataset. Like this, we can execute any kind of queries on Hive data using the Spark-SQL engine.

Let us now create one table in spark shell and let us see whether will it get reflected in the hive or not.

Following is the way you can create a table in Hive through Spark Shell.

1

val create_table = hiveContext.sql("create table spark_olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile")

We have created a table with the name spark_olympic but we haven’t loaded any data into this table.

Let us see whether it will get reflected in the Hive shell or not.

In the above screenshot, you can see the table spark_olympic is in the Hive Shell and there is no data in it.

Now, let us load the data from the Spark Shell into this table and query the data from Hive.

This is how we can load data into a Hive table from a Spark Shell.

1

val load_data = hiveContext.sql("load data local inpath '/home/kiran/Desktop/olympix_data.csv' into table spark_olympic")

Now let us query the data in the Hive Shell from Spark.

In the above screen shot, you can see that we have queries that recently loaded data.

We can run all the Hive queries on the Hive tables using the Spark SQL engine.