CREATE EXTERNAL TABLE
Valid in: SQL, ESQL, OpenAPI, ODBC, JDBC, .NET
The CREATE EXTERNAL TABLE statement maps the structure of a data file created outside of Vector to the structure of a Vector table. The data can then be queried from its original locations.
This statement has the following format:
CREATE EXTERNAL TABLE [IF NOT EXISTS] [schema.]table_name
(column_name data_type {,column_name data_type})
USING provider_name
WITH REFERENCE='reference'
[,FORMAT='format']
[EXTERNAL_ROLE='external_role']
[,OPTIONS=('key'=value {,'key'='value'})]
IF NOT EXISTS
Creates the table if it does not exist and returns without error if the table already exists.
Note: Use this option with caution. The table definition of a pre-existing table may differ with that of the CREATE EXTERNAL TABLE IF NOT EXISTS statement.
table_name
Defines the name of the external table to be created.
column_name data_type
Specifies the column name and data type of each column. Each column specification must be separated with a comma.
USING provider_name
Specifies the name of the provider. The only valid provider is SPARK.
Provider Name Application
SPARK Spark
WITH REFERENCE='reference'
(Required) Specifies the reference to the external datasource. This corresponds to the parameter passed to the load method of DataFrameReader or save method of DataFrameWriter. For example, in case of a database system, the reference string matches the source table name.
Depending on the FORMAT, the REFERENCE can now contain multiple SparkSQL statements (separated by ';') OR Scala script code. Only temporary schema elements like temporary views, tables or functions can be created within the SparkSQL or Scala snippets. Persistent Spark catalog items are not allowed. The schema of the output of the last SparkSQL or Scala statement must match the schema of the external table definition. The user is not allowed to create a new Spark session.
A pre-allocated Spark session exists and can be used via parameter "spark".
FORMAT='format'
(Optional) Is a WITH clause option that specifies the format of the external data.
Example formats include csv, avro, parquet, hive, orc, json, jdbc, vector. In some cases, format corresponds to the class name that defines that external datasource.
Note: The appropriate driver package must be installed.
This corresponds to the parameter passed to the format method of DataFrameReader or DataFrameWriter.
When FORMAT is not specified, the Spark-Vector Provider tries to recognize the format for files by looking at the file extension.
Enabling "Advanced Staging Tables" requires setting the format either to 'Scala' or 'SparkSQL', depending on the code snippet provided as REFERENCE string.
EXTERNAL_ROLE='external_role'
Note: This option is not implemented.
OPTIONS
(Optional) Is a WITH clause option that specifies user defined options for the datasource read or written to. This corresponds to the options method of the DataFrameReader/Writer. For example, for CSV files you can pass any options supported by spark-csv. For additional options supported by various external datasources, see the DataFrameReader documenation at
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html.
Useful options are:
'HEADER'='TRUE' | 'FALSE'
Indicates whether the data file contains a header row.
Note: When referencing a CSV file, a header is included by default.
'DELIMITER'='character'
Indicates the character used in the data file as the record delimiter. Example: 'delimiter'='|'
NULLVALUE='nullvalue'
Defines the string that identifies NULL values.
Default: ' '
'SCHEMA'='schema'
Specifies the table column definitions of the source using SparkSQL types. We recommend specifying this if the source file being loaded does not contain a header row. If not specified, Spark tries to infer the schema, which can degrade performance.
Note: The inferSchema=true option is ignored to avoid implicit casting of data types while parsing (CSV files, for example), since it can lead to unexpected results. Hence, if you do not specify a schema under OPTIONS for the source file, the schema of the external table will be used for file parsing. If you require casting of data types from the source file to the external table schema, use one of the following options: (1) Specify the source schema under OPTIONS (be aware that this implies implicit type casting to the external table schema by Spark). (2) For Spark Provider 3, use the staging table to explicitly cast data types. (3) Cast the data types in Vector.
'STAGING'='SparkSQL query string'
Specifies a SparkSQL query (
https://spark.apache.org/docs/latest/sql-ref.html) that creates an intermediate table in Spark for cleaning data or mapping schema on top of the source data before the data is transferred into Vector. It has the form of SELECT...FROM THIS_TABLE, where THIS_TABLE is a designated keyword. The schema of the staging table must exactly match the schema of the external table. For details, see the staging table examples in
CREATE EXTERNAL TABLE Examples.
Note: The feature is limited to the Spark-Vector Provider version 3 supporting Spark 3.1.1.
'FILTER'='SparkSQL filter predicates string'
Allows for specifying filter predicates on top of the data source in SparkSQL syntax (
https://spark.apache.org/docs/latest/sql-ref.html). These filters are evaluated by Spark. Spark itself determines which filters can be pushed down to the data source depending on its implementation. For details, see how filter predicates are used in the
CREATE EXTERNAL TABLE Examples.