How to Access S3 Cloud Storage

User Guide : B. Setting Up Spark for Use with Vector : How to Access S3 Cloud Storage

Share this page

You can use Spark to access CSV and other file formats directly in Amazon’s S3 cloud storage. The setup process for accessing S3 using the s3a:// protocol is as follows:

1. Download aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.1.jar and place them in the CLASSPATH.

2. Edit $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf and add the following:

spark.jars=hadoop-aws-2.7.1.jar,aws-java-sdk-1.7.4.jar

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

3. Obtain the "ACCESS KEY" and "SECRET KEY" for your AWS account and use them to set the following parameters in spark_provider.conf:

spark.hadoop.fs.s3a.access.key=MY_ACCESS_KEY

spark.hadoop.fs.s3a.secret.key=MY_SECRET_KEY

4. Restart the Spark-Vector Provider and ensure the new jars have been found. If the jars are not found, you will see the following errors:

ERROR SparkContext: Jar not found at file:/some/path/hadoop-aws-2.7.1.jar

ERROR SparkContext: Jar not found at file:/some/path/aws-java-1.7.4.jar

5. Connect to a database and create the external table using an s3a:// URL:

CREATE EXTERNAL TABLE my3tab (col1 INT, co2 VARCHAR(20))

USING SPARK WITH REFERENCE ('s3a://mys3bucket/path/to/data.csv');

Note: The connection is made when the table is referenced in a query.