Read and Write Files From HDFS With PySpark

How to read and write files from HDFS with PySpark.

  1. Install the following package:

    from pyspark.sql import SparkSession

    You can add the following code snippet to make it work from a Jupyter Notebook app in Saagie:

    import os
    os.environ["HADOOP_USER_NAME"] = "hdfs"
    os.environ["PYTHON_VERSION"] = "3.5.2"
  2. Create your Spark session by running the following lines of code:

    sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()

    Where:

    • "example-pyspark-read-and-write" can be replaced with the name of your Spark app.

  3. You can now read and write files from HDFS by running the following lines of code:

    • Read Files

    • Write Files

    # Read a CSV file from HDFS.
    df_load = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv')
    df_load.show()

    Where:

    • 'hdfs://cluster/user/hdfs/test/example.csv' must be replaced with the path to the CSV file in HDFS.

    # Create a DataFrame from the provided data.
    data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
    df = sparkSession.createDataFrame(data)
    
    # Write a CSV file in HDFS.
    df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")

    Where:

    • 'hdfs://cluster/user/hdfs/test/example.csv' must be replaced with the path where the file will be written in HDFS.