Configure Spark Resources

Using Spark on Saagie means running Spark jobs on Kubernetes. Here is an example of a job submission with Kubernetes:

spark-submit \
 --driver-memory 2G \
 --class <ClassName of the Spark Application to launch> \
 --conf spark.executor.memory=3G \ (1)
 --conf spark.executor.cores=4 \ (2)
 --conf spark.kubernetes.executor.limit.cores=4 (3)
 --conf spark.executor.instances=3 \ (4)
 {file}

Where:

1 spark.executor.memory is the amount of memory for each executor (request and limit).
2 spark.executor.cores is the amount of CPU cores requested for each executor.
3 spark.kubernetes.executor.limit.cores is the limit of CPU cores for each executor.
4 spark.executor.instances is the amount of executors for the application.
In the example above, the total provisioned cluster would be 3 executors of 4 cores and 3G memory each, or 12 CPU cores and 9G memory in total.
  • CPU: A known good practice is to provision between 2 and 4 cores per executor depending on your cluster topology. If you only have nodes with 4 CPUs in your cluster, Kubernetes will have a hard time finding a completely unoccupied node to spawn an executor with 4 cores, so you may want to limit it to 2 cores in this case.

  • Memory: A minimum of 4 GB per executor should ideally be provisioned.

  • Driver Memory: Unless you are retrieving a large amount of data from the executors to the driver, you do not need to change the default configuration, as the driver’s role is simply to orchestrate the various jobs in your Spark application.

See also

For more information on performance tuning in Spark, how to detect performance issues, and best practices for avoiding slowdowns or bottlenecks in your workflow, read the following articles: