Read and Write Files From HDFS With PySpark
-
Install the following package:
from pyspark.sql import SparkSession
You can add the following code snippet to make it work from a Jupyter Notebook app in Saagie:
import os os.environ["HADOOP_USER_NAME"] = "hdfs" os.environ["PYTHON_VERSION"] = "3.5.2"
-
Create your Spark session by running the following lines of code:
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
Where:
-
"example-pyspark-read-and-write"
can be replaced with the name of your Spark app.
-
-
You can now read and write files from HDFS by running the following lines of code:
# Read a CSV file from HDFS. df_load = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv') df_load.show()
Where:
-
'hdfs://cluster/user/hdfs/test/example.csv'
must be replaced with the path to the CSV file in HDFS.
# Create a DataFrame from the provided data. data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] df = sparkSession.createDataFrame(data) # Write a CSV file in HDFS. df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")
Where:
-
'hdfs://cluster/user/hdfs/test/example.csv'
must be replaced with the path where the file will be written in HDFS.
-
See also