Read and Write Files From HDFS With PySpark

How to read and write files from HDFS with PySpark.

Install the following package:

from pyspark.sql import SparkSession

You can add the following code snippet to make it work from a Jupyter Notebook app in Saagie:

import os
os.environ["HADOOP_USER_NAME"] = "hdfs"
os.environ["PYTHON_VERSION"] = "3.5.2"

Create your Spark session by running the following lines of code:
```
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
```
Where:
- "example-pyspark-read-and-write" can be replaced with the name of your Spark app.

You can now read and write files from HDFS by running the following lines of code:

Read Files
Write Files

# Read a CSV file from HDFS.
df_load = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv')
df_load.show()

Where:

'hdfs://cluster/user/hdfs/test/example.csv' must be replaced with the path to the CSV file in HDFS.

# Create a DataFrame from the provided data.
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)

# Write a CSV file in HDFS.
df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")

Where:

'hdfs://cluster/user/hdfs/test/example.csv' must be replaced with the path where the file will be written in HDFS.

See also

Code example to read and write files from HDFS with PySpark (GitHub page)