Read and Write Files From Amazon S3 Buckets With PySpark

How to read and write files from Amazon S3 buckets with PySpark.

To interact with Amazon S3 buckets from Spark in Saagie, you must use one of the compatible Spark 3.1 AWS technology contexts available in the Saagie repository. These contexts already have the .jar files needed to connect to an S3-compatible object storage.
  1. When creating your Spark job in Saagie, use one of its 3.1 AWS Python contexts.

  2. Create your Spark session with the following configuration to access data stored in your Amazon S3:

    # Import your SparkSession
    from pyspark.sql import SparkSession
    
    # Create your SparkSession
    spark = SparkSession.builder \
        .appName("My Application") \
        .config("spark.hadoop.fs.s3a.endpoint", "my-s3.endpoint") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
        .config("spark.hadoop.fs.s3a.access.key", s3_access_key) \
        .config("spark.hadoop.fs.s3a.secret.key", s3_secret_key) \
        .getOrCreate()

    Where:

    • "my-s3.endpoint" does not need to be specified if your S3 bucket is hosted on AWS. This parameter is useful when your S3 bucket is hosted by another provider, such as OVH. In that case, you will need to specify the full hostname, that is, https://s3.gra.perf.cloud.ovh.net.

    • We recommend that you store your S3 bucket credentials, access_key and secret_key, in environment variables.

    • Whenever you interact with your S3 bucket, be sure to use the S3A protocol as configured in the Spark session above.

  3. You can now read and write files from your Amazon S3 bucket by running the following lines of code:

    • Read Files

    • Write Files

    spark.read.parquet("s3a://path/to/my/file.parquet")
    df.write.parquet("s3a://path/to/my/file.parquet")
    Performance tuning

    Cloud Object Stores are not real filesystems, which has consequences on the performance. The Spark documentation is clear on this.
    Also, make sure to read the recommendations on the best configuration based on your cloud provider.