Adding an Amazon S3 Connection

Technical Documentation : Connectors : Adding an Amazon S3 Connection

Was this helpful?

In order to establish a connection with Amazon S3, a user with sufficient permissions is required.

Zeenea traffic flows towards the database must be opened.

There are two ways to authenticate the Zeenea Agent on S3:

Either the Agent's host server has sufficient credentials to connect to S3 (recommended way).
In this case, available authentication methods are:
- Instance Role
- Environment Variable
- Configuration File
Or, the API Keys are managed by Zeenea. In this case, the keys need to be entered in the connection configuration file.

Traffic flows toward S3 must be open.

Target	Protocol	Usual Ports
AWS S3	HTTPS	443

Note: You can find a link to the configuration template in Zeenea Connector Downloads.

Supported Versions

The S3 connector was successfully tested with the online application.

Installing the Plugin

From version 54 of the scanner, the S3 connector is presented as a plugin and can be downloaded here: Zeenea Connector Downloads.

For more information about how to install a plugin, see Installing and Configuring Connectors as a Plugin.

Declaring the Connection

Creating and configuring connectors is done through a dedicated configuration file located in the /connections folder of the relevant scanner.

Read more: Managing Connections

In order to establish a connection with Amazon S3, specifying the following parameters in the dedicated file is required:

Parameter	Expected Value
name	The name that will be displayed to catalog users for this connection.
code	Unique identifier of the connection on the Zeenea platform. Once registered on the platform, this code must not be modified or the connection will be considered as new and the old one removed from the scanner.
connector_id	The type of connector to be used for the connection. Here, the value must be AmazonS3 and this value must not be modified.
enabled	A boolean value to enable or disable the connection.
catalog_code	The catalog code associated with the connection (default when empty).
alias	The list of aliases used by other connectors to generate lineage link.
secret_manager.enabled	Configuration for a secret manager. This configuration works only with Scanner 73 or later and requires a functional secret manager configured in the scanner configuration file.
secret_manager.key	The name of the secret.
connection.aws.access_key_id	AWS S3 access key identifier
connection.aws.secret_access_key	AWS S3 secret access key
connection.aws.region	AWS region
s3.bucket_list	The list of buckets to be explored. The separator is space (" "). If this setting is left empty, the connector will explore all buckets accessible by the user.
connection.url	Include this setting if you wish to use an S3 instance other than Amazon's.
proxy.endpoint	Proxy port
proxy.username	Proxy username
proxy.password	Proxy account password

Optional parameters for customizing dataset detection

Parameter	Expected Value
inventory.strategy	Determines the algorithm used to discover datasets. It has two possible values: standard or legacy (default). - Legacy strategy is the algorithm that was used before version 59. - Standard strategy can manage cases where multiple datasets are in the same folder.
inventory.file_partition_pattern	Regex used within the standard inventory strategy to define the variable part of a file name.
inventory.partition	Regex to identify partition folders.
inventory.skippedDirectory	Regex on the name of the folders to ignore while keeping the content taken into account. The content will be scanned as if it were at the root of the parent folder. No folder is ignored by default.
inventory.ignoredDirectory	Regex on the name of the folders to ignore: their content will also be ignored. No folder is ignored by default.
inventory.ignoredFile	Regex in the name of the files to ignore. Default value: \..* \| _.* \| .*\.crc
inventory.extension.csv	For CSV files detection, default value: csv, tsv, csv.gz, tsv.gz, csv.zip, tsv.zip
inventory.extension.parquet	For Parquet files detection, default value: parquet
inventory.extension.avro	For Avro files detection, default value: avro
inventory.extension.orc	For Orc files detection, default value: orc
inventory.extension.xml	For Xml files detection, default value: xml, xml.gz, xml.zip
inventory.extension.json	For Json files detection, default value: json, json.gz, json.zip
inventory.csv.header	Used for configuring csv files header detection pattern. Select always to force recognizing the schema on the first line of csv files. Possible values are: never, default, always, and only string. The default value is default.
cache.path	When inventoried buckets are very large and contain a lot of objects, the connector can consume a large amount of memory. It is possible to cache the objects list on disk to reduce memory consumption. To enable disk cache, just set the path to a file. (You can both set an absolute or a relative path, however, relative paths are dependant of the current directory at the scanner launch time). Examples: "/opt/zeenea-scanner/cache/s3.cache" "/var/lib/zeenea-scanner/s3.cache" """C:\zeenea-scanner\cache\s3.cache""" NOTE: When dealing with multiple S3 connections, pay attention to configure different cache files for each of your connections in order to avoid conflicts.
xml.namespace_identification	Used for configuring XML fields identification. Value uri to use except to keep the compatibility with a scanner previous to version 43, it is necessary to use the value legacy (default value).
xml.fields_ordering	Starting from version 67. Allows ordering the list of retrieved fields. Possible values are: alphabetical: Fields are ordered alphabetically "", legacy, or unordered. Fields are ordered as they are read.
filter	To filter datasets during the inventory. See Rich Filters

User Permissions

In order to collect metadata, the user must be able to list and read all datasets that need to be cataloged.

Roles

The user must be able to run the following actions on the target bucket and the objects it contains:

s3:GetObject
s3:ListBucket
s3:GetBucketLocation

The user must also be able to run the s3:ListAllMyBuckets action on all resources.

In the following examples, the user is cataloging the zeenea-test-data bucket:

Directly from the Amazon S3 interface.

In JSON:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::zeenea-test-data",
"arn:aws:s3:::zeenea-test-data/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets"
],
"Resource": "*"
}
]
}

Rich Filters

Since version 54 of the scanner, the S3 connector benefits from the feature of rich filters in the configuration of the connector. The available keys to filter objects during the inventory are:

bucket: name of the bucket
key: name of the object

Object	Identification Key	Description
Dataset	code/path/dataset name	- code: Unique identifier of the connection noted in the configuration file - path: Full path including the bucket - dataset name
Field	code/path/dataset name/field name	- code: Unique identifier of the connection noted in the configuration file - path: Full path including the bucket - dataset name - field name