Read and Write Files From Hive With PySpark

How to read and write tables from Hive with PySpark.

  1. Install the following packages:

    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession, HiveContext

    Add the following code snippet to make it work from a Jupyter Notebook app in Saagie:

    import os
    os.environ["HADOOP_USER_NAME"] = "hdfs"
    os.environ["PYTHON_VERSION"] = "3.5.2"
  2. Configure the Hive Metastore URI as follows:

    sparkSession = (SparkSession
    data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
    df = sparkSession.createDataFrame(data)
  3. Create your Spark session by running the following lines of code:

    sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
  4. You can now read and write files from HDFS by running the following lines of code:

    • Read Files

    • Write Files

    # To read files from Hive.
    df_load = sparkSession.sql('SELECT * FROM example')
    # To write files in Hive.
    Spark 3.1 with Hive 1.1.0

    Starting with Spark 3.1, you have to update your command line to connect to a Hive Metastore V1.1.0. To help you out, the Hive .jar files for version 1.1.0 have already been uploaded to the Saagie Spark Docker images in the /opt/spark/hive_1.1.0_jars directory. To load these .jar files in your classpath, add the following parameters to the command line of your Spark job:

    spark-submit \
    --conf spark.sql.hive.metastore.jars=path \
    --conf spark.sql.hive.metastore.jars.path=file:///opt/spark/hive_1.1.0_jars/*.jar \
    --conf spark.sql.hive.metastore.version=1.1.0 \
    --conf spark.sql.catalogImplementation=hive