Read and Write Tables From Hive With PySpark

How to read and write tables from Hive with PySpark.

Install the following packages:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext

You can add the following code snippet to make it work from a Jupyter Notebook app in Saagie:

import os
os.environ["HADOOP_USER_NAME"] = "hdfs"
os.environ["PYTHON_VERSION"] = "3.5.2"

Initialize your Spark session with the following Hive Metastore URI configuration:

# Initialize a Spark session with Hive support enabled.
sparkSession = (SparkSession
 .builder
 .appName('example-pyspark-read-and-write-from-hive')
 .enableHiveSupport()
 .getOrCreate())

# Create a DataFrame from the provided data.
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)

Where:

"example-pyspark-read-and-write" can be replaced with the name of your Spark app.

Create your Spark session by running the following lines of code:
```
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
```
Where:
- "example-pyspark-read-and-write" can be replaced with the name of your Spark app.

You can now read and write tables from Hive by running the following lines of code:

Read Tables
Write Tables

# Read a table from Hive.
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()

Where:

example can be replaced with a name for your table.

# Write a table in Hive.
df.write.saveAsTable('example')

Where:

example can be replaced with a name for your table.

Spark 3.1 with Hive 1.1.0

Starting with Spark 3.1, you have to update your command line to connect to a Hive Metastore V1.1.0. To help you out, the Hive .jar files for version 1.1.0 have already been uploaded to the Saagie Spark Docker images in the /opt/spark/hive_1.1.0_jars directory. To load these .jar files in your classpath, add the following parameters to the command line of your Spark job:

spark-submit \
...
--conf spark.sql.hive.metastore.jars=path \
--conf spark.sql.hive.metastore.jars.path=file:///opt/spark/hive_1.1.0_jars/*.jar \
--conf spark.sql.hive.metastore.version=1.1.0 \
--conf spark.sql.catalogImplementation=hive
...

See also

Code example to read and write files from HDFS with PySpark (GitHub page)