Read and Write Files From Amazon S3 Buckets With PySpark
To interact with Amazon S3 buckets from Spark in Saagie, you must use one of the compatible Spark 3.1 AWS technology contexts available in the Saagie repository. These contexts already have the .jar files needed to connect to an S3-compatible object storage.
|
-
When creating your Spark job in Saagie, use one of its
3.1 AWS Python
contexts. -
Create your Spark session with the following configuration to access data stored in your Amazon S3:
# Import your SparkSession from pyspark.sql import SparkSession # Create your SparkSession spark = SparkSession.builder \ .appName("My Application") \ .config("spark.hadoop.fs.s3a.endpoint", "my-s3.endpoint") \ .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \ .config("spark.hadoop.fs.s3a.access.key", s3_access_key) \ .config("spark.hadoop.fs.s3a.secret.key", s3_secret_key) \ .getOrCreate()
Where:
-
"my-s3.endpoint"
does not need to be specified if your S3 bucket is hosted on AWS. This parameter is useful when your S3 bucket is hosted by another provider, such as OVH. In that case, you will need to specify the full hostname, that is,https://s3.gra.perf.cloud.ovh.net
. -
We recommend that you store your S3 bucket credentials,
access_key
andsecret_key
, in environment variables. -
Whenever you interact with your S3 bucket, be sure to use the S3A protocol as configured in the Spark session above.
-
-
You can now read and write files from your Amazon S3 bucket by running the following lines of code:
spark.read.parquet("s3a://path/to/my/file.parquet")
df.write.parquet("s3a://path/to/my/file.parquet")
Performance tuningCloud Object Stores are not real filesystems, which has consequences on the performance. The Spark documentation is clear on this.
Also, make sure to read the recommendations on the best configuration based on your cloud provider.