Read and Write Files From HDFS With PySpark
-
Install the following package:
from pyspark.sql import SparkSession
Add the following code snippet to make it work from a Jupyter Notebook app in Saagie:
import os os.environ["HADOOP_USER_NAME"] = "hdfs" os.environ["PYTHON_VERSION"] = "3.5.2"
-
Create your Spark session by running the following lines of code:
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
-
You can now read and write files from HDFS by running the following lines of code:
# To read files from HDFS. df_load = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv') df_load.show()
# To create data. data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] df = sparkSession.createDataFrame(data) # To write files in HDFS. df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")