Read and Write Files From HDFS With PySpark

How to read and write files from HDFS with PySpark.

  1. Install the following package:

    from pyspark.sql import SparkSession

    Add the following code snippet to make it work from a Jupyter Notebook app in Saagie:

    import os
    os.environ["HADOOP_USER_NAME"] = "hdfs"
    os.environ["PYTHON_VERSION"] = "3.5.2"
  2. Create your Spark session by running the following lines of code:

    sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
  3. You can now read and write files from HDFS by running the following lines of code:

    • Read Files

    • Write Files

    # To read files from HDFS.
    df_load = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv')
    df_load.show()
    # To create data.
    data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
    df = sparkSession.createDataFrame(data)
    
    # To write files in HDFS.
    df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")