Read and Write Files From HDFS With Spark Scala

How to read and write files from HDFS with Spark Scala.

  1. Install the following SBT dependencies:

    libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" % "provided"
    libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided"
  2. Create your Spark session by running the following lines of code:

    val sparkSession = SparkSession.builder().appName("example-spark-scala-read-and-write-from-hdfs").getOrCreate()
  3. You can now read and write files from HDFS by running the following lines of code:

    • Read Files

    • Write Files

    // ====== To read files.
    // To read parquet files in a Spark Dataframe.
    val df_parquet = session.read.parquet(hdfs_master + "user/hdfs/wiki/testwiki")
    // To read CSV files in a Spark Dataframe.
    val df_csv = sparkSession.read.option("inferSchema", "true").csv(hdfs_master + "user/hdfs/wiki/testwiki.csv")
    // ====== To define an HelloWorld class.
    case class HelloWorld(message: String)
    
    // ====== To create a Dataframe with 1 partition.
    val df = Seq(HelloWorld("helloworld")).toDF().coalesce(1)
    
    // ====== To write files
    // To write Dataframe as a parquet file.
    df.write.mode(SaveMode.Overwrite).parquet(hdfs_master + "user/hdfs/wiki/testwiki")
    // To write Dataframe as a CSV file.
    df.write.mode(SaveMode.Overwrite).csv(hdfs_master + "user/hdfs/wiki/testwiki.csv")