Read and Write Tables From Hive With PySpark

How to read and write tables from Hive with PySpark.

  1. Install the following packages:

    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession, HiveContext

    You can add the following code snippet to make it work from a Jupyter Notebook app in Saagie:

    import os
    os.environ["HADOOP_USER_NAME"] = "hdfs"
    os.environ["PYTHON_VERSION"] = "3.5.2"
  2. Initialize your Spark session with the following Hive Metastore URI configuration:

    # Initialize a Spark session with Hive support enabled.
    sparkSession = (SparkSession
     .builder
     .appName('example-pyspark-read-and-write-from-hive')
     .enableHiveSupport()
     .getOrCreate())
    
    # Create a DataFrame from the provided data.
    data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
    df = sparkSession.createDataFrame(data)

    Where:

    • "example-pyspark-read-and-write" can be replaced with the name of your Spark app.

  3. Create your Spark session by running the following lines of code:

    sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()

    Where:

    • "example-pyspark-read-and-write" can be replaced with the name of your Spark app.

  4. You can now read and write tables from Hive by running the following lines of code:

    • Read Tables

    • Write Tables

    # Read a table from Hive.
    df_load = sparkSession.sql('SELECT * FROM example')
    df_load.show()

    Where:

    • example can be replaced with a name for your table.

    # Write a table in Hive.
    df.write.saveAsTable('example')

    Where:

    • example can be replaced with a name for your table.

    Spark 3.1 with Hive 1.1.0

    Starting with Spark 3.1, you have to update your command line to connect to a Hive Metastore V1.1.0. To help you out, the Hive .jar files for version 1.1.0 have already been uploaded to the Saagie Spark Docker images in the /opt/spark/hive_1.1.0_jars directory. To load these .jar files in your classpath, add the following parameters to the command line of your Spark job:

    spark-submit \
    ...
    --conf spark.sql.hive.metastore.jars=path \
    --conf spark.sql.hive.metastore.jars.path=file:///opt/spark/hive_1.1.0_jars/*.jar \
    --conf spark.sql.hive.metastore.version=1.1.0 \
    --conf spark.sql.catalogImplementation=hive
    ...