Read and Write Tables From Hive With PySpark
-
Install the following packages:
from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, HiveContext
Add the following code snippet to make it work from a Jupyter Notebook app in Saagie:
import os os.environ["HADOOP_USER_NAME"] = "hdfs" os.environ["PYTHON_VERSION"] = "3.5.2"
-
Configure the Hive Metastore URI as follows:
sparkSession = (SparkSession .builder .appName('example-pyspark-read-and-write-from-hive') .enableHiveSupport() .getOrCreate()) data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] df = sparkSession.createDataFrame(data)
-
Create your Spark session by running the following lines of code:
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
-
You can now read and write tables from Hive by running the following lines of code:
# To read files from Hive. df_load = sparkSession.sql('SELECT * FROM example') df_load.show()
# To write files in Hive. df.write.saveAsTable('example')
Spark 3.1 with Hive 1.1.0Starting with Spark
3.1
, you have to update your command line to connect to a Hive MetastoreV1.1.0
. To help you out, the Hive.jar
files for version1.1.0
have already been uploaded to the Saagie Spark Docker images in the/opt/spark/hive_1.1.0_jars
directory. To load these.jar
files in yourclasspath
, add the following parameters to the command line of your Spark job:spark-submit \ ... --conf spark.sql.hive.metastore.jars=path \ --conf spark.sql.hive.metastore.jars.path=file:///opt/spark/hive_1.1.0_jars/*.jar \ --conf spark.sql.hive.metastore.version=1.1.0 \ --conf spark.sql.catalogImplementation=hive ...