Read and Write Tables From Hive With PySpark
-
Install the following packages:
from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, HiveContext
You can add the following code snippet to make it work from a Jupyter Notebook app in Saagie:
import os os.environ["HADOOP_USER_NAME"] = "hdfs" os.environ["PYTHON_VERSION"] = "3.5.2"
-
Initialize your Spark session with the following Hive Metastore URI configuration:
# Initialize a Spark session with Hive support enabled. sparkSession = (SparkSession .builder .appName('example-pyspark-read-and-write-from-hive') .enableHiveSupport() .getOrCreate()) # Create a DataFrame from the provided data. data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] df = sparkSession.createDataFrame(data)
Where:
-
"example-pyspark-read-and-write"
can be replaced with the name of your Spark app.
-
-
Create your Spark session by running the following lines of code:
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
Where:
-
"example-pyspark-read-and-write"
can be replaced with the name of your Spark app.
-
-
You can now read and write tables from Hive by running the following lines of code:
# Read a table from Hive. df_load = sparkSession.sql('SELECT * FROM example') df_load.show()
Where:
-
example
can be replaced with a name for your table.
# Write a table in Hive. df.write.saveAsTable('example')
Where:
-
example
can be replaced with a name for your table.
Spark 3.1 with Hive 1.1.0Starting with Spark
3.1
, you have to update your command line to connect to a Hive MetastoreV1.1.0
. To help you out, the Hive.jar
files for version1.1.0
have already been uploaded to the Saagie Spark Docker images in the/opt/spark/hive_1.1.0_jars
directory. To load these.jar
files in yourclasspath
, add the following parameters to the command line of your Spark job:spark-submit \ ... --conf spark.sql.hive.metastore.jars=path \ --conf spark.sql.hive.metastore.jars.path=file:///opt/spark/hive_1.1.0_jars/*.jar \ --conf spark.sql.hive.metastore.version=1.1.0 \ --conf spark.sql.catalogImplementation=hive ...
-