Read and Write Files From HDFS With Spark Scala

How to read and write files from HDFS with Spark Scala.

Install the following sbt (Scala Build Tool) dependencies:
```
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided"
```
As the Spark Scala dependencies already exist in Saagie, they are specified as "provided" to avoid having a heavy JAR file.
Create your Spark session by running the following lines of code:
```
val sparkSession = SparkSession.builder().appName("example-spark-scala-read-and-write-from-hdfs").getOrCreate()
```
Where:
- "example-pyspark-read-and-write" can be replaced with the name of your Spark app.

You can now read and write files from HDFS by running the following lines of code:

Read Files
Write Files

// Read files
// Read parquet files in Spark DataFrames.
val df_parquet = session.read.parquet(hdfs_master + "user/hdfs/wiki/testwiki")
// Read CSV files in Spark DataFrames.
val df_csv = sparkSession.read.option("inferSchema", "true").csv(hdfs_master + "user/hdfs/wiki/testwiki.csv")

// Define a case class named HelloWorld with a single attribute message of type String.
case class HelloWorld(message: String)

// Create a DataFrame with 1 partition.
val df = Seq(HelloWorld("helloworld")).toDF().coalesce(1)

// Write files
// Write the DataFrame to a Parquet file with the specified path.
df.write.mode(SaveMode.Overwrite).parquet(hdfs_master + "user/hdfs/wiki/testwiki")
// Write the DataFrame to a CSV file with the specified path.
df.write.mode(SaveMode.Overwrite).csv(hdfs_master + "user/hdfs/wiki/testwiki.csv")

See also