Read and Write Files From Amazon S3 Bucket With PySpark

How to read and write files from Amazon S3 Bucket with PySpark.

To interact with Amazon S3 Bucket from Spark, you must use a compatible version of Spark, such as Spark 3.1 AWS. This version already has the required .jar files to connect to a S3-compatible object storage.
  1. Use Spark 3.1 AWS.

  2. Create your Spark session with the following lines of code:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
        .appName("My Application") \
        .config("spark.hadoop.fs.s3a.endpoint", "my-s3.endpoint") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
        .config("spark.hadoop.fs.s3a.access.key", s3_access_key) \
        .config("spark.hadoop.fs.s3a.secret.key", s3_secrey_key) \
    • Do not specify the endpoint configuration if your S3 Bucket is hosted on AWS. This parameter is useful when your S3 Bucket is hosted by another provider, such as OVH. In that case, you will need to specify the full hostname, that is,

    • We recommend storing your S3 Bucket credentials, namely, access_key and secrey_key, in environment variables.

    • Whenever you interact (read or write) with Amazon S3 Bucket, you must use the S3A protocol, as configured in the Spark session above.

  3. You can now read and write files from Amazon S3 Bucket by running the following lines of code:

    • Read Files

    • Write Files

    df ="s3a://path/to/my/file.parquet)
    df = sql.write.parquet("s3a://path/to/my/file.parquet)
    Performance tuning

    Cloud Object Stores are not real filesystems, which has consequences on the performance. The Spark documentation is clear on this.
    Also, make sure to read the recommendations on the best configuration based on your cloud provider.