Package Your Spark Scala Code With the Assembly Plugin

To run any Spark Scala jobs on Saagie, you must package your code with the Assembly plugin to gather all your project dependencies into a fat JAR file.

  1. Install the plugin in your projects by adding the following line in the project/assembly.sbt file:

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")
  2. Configure the plugin by adding the following lines in the build.sbt file:

    import sbt.Keys._
    assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

    This configuration ensures that the created fat JAR only includes your project’s dependencies and not the Scala standard library.

    Example of the build.sbt file
    import sbt.Keys._
    assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
    
    name := "my-spark-application"
    version := "0.1"
    scalaVersion := "2.11.12"
    val SPARK_VERSION = "2.4.0"
    
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided", (1)
      "org.apache.spark" %% "spark-sql" % SPARK_VERSION % "provided" (1)
    )
    
    assemblyMergeStrategy in assembly := {
      case PathList("META-INF", xs@_*) =>
        xs map {
          _.toLowerCase
        } match {
          case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
          case _ => MergeStrategy.discard
        }
      case "conf/application.conf" => MergeStrategy.concat
      case _ => MergeStrategy.first
    }
    
    test in assembly := {}
    parallelExecution in Test := false

    Where:

    1 As the Spark Scala dependencies already exist in Saagie, specify them as "provided" in your build.sbt file to avoid having a heavy JAR file.