User Guide : B. Setting Up Spark for Use with Vector : How to Securely Manage S3 Credentials
 
Share this page                  
How to Securely Manage S3 Credentials
AWS credentials grant access to data on S3, so it is important to keep them safe. Referencing the credentials in the target URI from the command line (when using vwload, for example) or in environment variables can leave them easily accessible in logs, command histories, and configuration files.
The Hadoop credential provider framework allows secure “credential providers” to keep the AWS credentials outside Hadoop configuration files, storing them in encrypted files in local or Hadoop file systems. The following describes how the framework can be used to protect AWS credentials when accessing S3 through Spark or Hadoop.
1. Create a credential file to store the credentials on HDFS. For example:
hadoop credential create fs.s3a.access.key -value MY_ACCESS_KEY -provider jceks://file/opt/Actian/VectorVH/ingres/files/hdfs/s3.jceks
hadoop credential create fs.s3a.secret.key -value MY_SECRET_KEY -provider jceks://file/opt/Actian/VectorVH/ingres/files/hdfs/s3.jceks
Notes:
The credential file can exist on local or HDFS storage. To use HDFS, replace ‘file’ with ‘hdfs’ or the storage class being used. If the credential file is on local storage, it must be present in the same location on all nodes.
The credential store is password protected. The default password will be used unless another is specified by setting HADOOP_CREDSTORE_PASSWORD in the environment prior to running the above commands. The password can also be stored in a file pointed to by hadoop.security.credstore.java-keystore-provider.password-file in the Hadoop configuration. For details, see the Credential Provider API Guide.
2. Verify the contents of the credential store (no values shown):
hadoop credential list -provider jceks://file/opt/Actian/VectorVH/ingres/files/hdfs/s3.jceks
fs.s3a.secret.key
fs.s3a.access.key
3. Update the Hadoop configuration (core-site.xml) to define where S3 credentials are stored.
<property>
  <name>fs.s3a.security.credential.provider.path</name>
  <value>jceks://file/opt/Actian/VectorVH/ingres/files/hdfs/s3.jceks<value/>
   <description>
    Optional comma separated list of credential providers, a list
    which is pre-pended to that set in hadoop.security.credential.provider.path
   </description>
</property>
Note:  Configuration files should be updated on all nodes. Using a cluster manager such as Ambari to make these changes is highly recommended.
If using a publicly accessible bucket that does not require access keys, set the following in core-site.xml instead:
<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>
4. Verify the S3 bucket can be accessed using the HDFS client:
hdfs dfs -ls s3a://actian-ontime/
Found 8 items
-rw-rw-rw-   1 actian actian   181080083 2018-02-14 16:50 s3a://actian-ontime/On_Time_On_Time_Performance_1988_1.csv
-rw-rw-rw-   1 actian actian 18452125824 2017-11-28 12:04 s3a://actian-ontime/On_Time_On_Time_Performance_Part1
-rw-rw-rw-   1 actian actian 19494962213 2017-11-29 23:54 s3a://actian-ontime/On_Time_On_Time_Performance_Part2
-rw-rw-rw-   1 actian actian 19725641334 2017-11-29 19:53 s3a://actian-ontime/On_Time_On_Time_Performance_Part3
-rw-rw-rw-   1 actian actian 19773821142 2017-11-28 13:51 s3a://actian-ontime/On_Time_On_Time_Performance_Part4
-rw-rw-rw-   1 actian actian 18452127455 2018-02-15 12:58 s3a://actian-ontime/On_Time_Performance_Part1_WithHeader.csv
-rw-rw-rw-   1 actian actian          28 2018-02-14 15:25 s3a://actian-ontime/WithHeader.csv
-rw-rw-rw-   1 actian actian          19 2018-02-14 15:25 s3a://actian-ontime/WithNoHeader.csv