Configuring Spark to Access S3 Cloud Storage
You can use Spark to access CSV and other file formats directly in Amazon’s S3 cloud storage. The setup process for accessing S3 using the s3a:// protocol is as follows:
1. Download the JAR files for S3 cloud storage:
• hadoop-aws-2.7.1.jar
• aws-java-sdk-1.7.4.jar
2. Set II_SPARKJARS environment variable to the path where the S3A JARs reside.
Either issue the command ingsetenv II_SPARKJARS “/path to jars” or add the export II_SPARKJARS=path to jars command to the .ingXXsh environment file.
3. Source .ingXXsh (where XX is the Vector instance ID).
4. Check that II_SPARKJARS has been updated:
echo $II_SPARKJARS
5. Run the following command to configure access to S3 cloud storage:
iisuspark -s3a
Lines like the following are added to $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf:
spark.jars=hadoop-aws-2.7.1.jar,aws-java-sdk-1.7.4.jar
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=obtain the "ACCESS KEY" for your AWS account and place here
spark.hadoop.fs.s3a.secret.key=obtain the "SECRET KEY" for your AWS account and place here
6. Obtain the “ACCESS KEY” and “SECRET KEY” for your AWS account and use them to set the following parameters in spark_provider.conf:
spark.hadoop.fs.s3a.access.key=MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key=MY_SECRET_KEY
7. Restart the Spark-Vector Provider and ensure the new JAR files have been found. If the JARs are not found, you will see the following errors:
ERROR SparkContext: Jar not found at file:/some/path/hadoop-aws-2.7.1.jar
ERROR SparkContext: Jar not found at file:/some/path/aws-java-1.7.4.jar
After Spark is configured, you can connect to a database and create an external table using an s3a:// URL:
CREATE EXTERNAL TABLE my3tab (col1 INT, co2 VARCHAR(20))
USING SPARK WITH REFERENCE ('s3a://mys3bucket/path/to/data.csv');
The connection is made when the table is referenced in a query.
Last modified date: 12/19/2024