Was this helpful?
CREATE EXTERNAL TABLE
Valid in: SQL, ESQL, OpenAPI, ODBC, JDBC, .NET
The CREATE EXTERNAL TABLE statement maps the structure of a data file created outside of Vector to the structure of a Vector table. The data can then be queried from its original locations.
This statement has the following format:
CREATE EXTERNAL TABLE [IF NOT EXISTS] [schema.]table_name
(column_name data_type {,column_name data_type})
USING provider_name
WITH REFERENCE='reference'
[,FORMAT='format']
[EXTERNAL_ROLE='external_role']
[,OPTIONS=('key'=value {,'key'='value'})]
 
IF NOT EXISTS
Creates the table if it does not exist and returns without error if the table already exists.
Note:  Use this option with caution. The table definition of a preexisting table may differ from that of the CREATE EXTERNAL TABLE IF NOT EXISTS statement.
table_name
Defines the name of the external table to be created.
column_name data_type
Specifies the column name and data type of each column. Each column specification must be separated with a comma.
USING provider_name
Specifies the name of the provider. The only valid provider is SPARK.
Provider Name   Application
SPARK               Spark
WITH REFERENCE='reference'
(Required) Specifies the reference to the external datasource. This corresponds to the parameter passed to the load method of DataFrameReader or the save method of DataFrameWriter. It is typically a fully qualified path or identifier defined by the target. For example:
For HDFS file: hdfs://namenode:port/path/to/file
For database systems: The table name
FORMAT='format'
(Optional) Is a WITH clause option that specifies the format of the external data.
Example formats include: csv, avro, parquet, hive, orc, json, jdbc. For other datasources, format corresponds to the class name that defines that external datasource. For example, for Redshift it would be com.databricks.spark.redshift. This corresponds to the parameter passed to the format method of DataFrameReader/Writer.
When FORMAT is not specified, the Spark-Vector Provider tries to recognize the format for HDFS files by looking at the file extension. For example, hdfs://namenode:port/tmp/my_file.orc is treated as an ORC file.
EXTERNAL_ROLE='external_role'
Note:  This option is not implemented.
OPTIONS
(Optional) Is a WITH clause option that specifies user defined options for the datasource read from or written to. The options are not Vector-specific; they are given by Spark and correspond to the options method of the DataFrameReader/Writer. For example, for CSV files you can pass any options supported by spark-csv. For additional options supported by various external datasources, see the DataFrameReader documentation at https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html.
Useful options are:
'HEADER'='TRUE' | 'FALSE'
Indicates whether the data file contains a header row.
Note:  When referencing a CSV file, a header is included by default.
'DELIMITER'='character'
Indicates the character used in the data file as the record delimiter. Example: 'delimiter'='|'
'SCHEMA'='schema'
Specifies the table column definitions, which are required if the data file being loaded does not contain a header row.
Predicate Pushdown
If a SQL query is issued in Vector to an EXTERNAL TABLE using Spark, predicates from the WHERE clause may already be evaluated by Spark, reducing the number of tuples sent. However, currently this works only under the following circumstances:
It is implemented for all Vector column data types except “time with time zone” and “timestamp with timezone.”
Supported predicates are: <, <=, =, >=, and >. Using “like” is currently unsupported. Only predicates that can be translated into a column value range are supported.
For logical connections of simple predicates, only AND is supported. If the whole complex predicate contains a single OR, nothing is pushed down to Spark. In this case, Spark transfers all tuples to Vector and the filtering is done solely on the Vector side.
Predicate pushdown is considered only from Vector to Spark. The predicate evaluation on the Spark side is determined by the Spark framework. How far Spark and its internal optimizer push down predicates (how close to the source) depends on the specific file format and the Spark version. For example, in Spark 2, Parquet files may already be filtered during parsing, but this is not possible for CSV files. Filtering during CSV file parsing is supported in Spark 3.
CREATE EXTERNAL TABLE Examples
Note:  The column names in the Vector external table must match the column names in the reference source.
1. Define an external table for a CSV data source:
CREATE EXTERNAL TABLE ext_csv (col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL)
USING SPARK
WITH REFERENCE='hdfs:////namenode:8020/user/mark/file.csv',
 OPTIONS=('DELIMITER' = '|', 'HEADER' = 'TRUE','SCHEMA'='col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL');
Notes:
If the CSV file does not have a header, the name of the columns in the external table must match those in spark-csv, that is, C0, C1, .... If the option 'HEADER' = 'TRUE' is not specified and the Vector external table definition has different names, you will get an error like:
E_VW1213 External table provider reported an error
'org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns C0, C1; line 1 pos 38'.
If external table columns are defined as NOT NULL, the table schema must be specified with NOT NULL in the SCHEMA option, even if header is set to true. This is because if Spark SQL implicitly determines the schema information from, for example, a CSV file, the nullability of columns typically defaults to true.
2. Define an external table for a CSV data source residing in Amazon S3. The CSV data does not have a header row:
CREATE EXTERNAL TABLE nation_s3 (
   n_nationkey INTEGER NOT NULL,
   n_name CHAR(25) NOT NULL,
   n_regionkey INTEGER NOT NULL
) USING SPARK WITH REFERENCE='s3a://<bucket>/nation.csv',FORMAT='csv',
OPTIONS=('HEADER'='FALSE','DELIMITER'='|','SCHEMA'='n_nationkey INT NOT NULL, n_name STRING NOT NULL, n_regionkey INT NOT NULL');
Note:  The column names in the schema must match the column names of the external table. The Spark data type specified in the schema must be compatible with the Vector data type for the matching column. See “Vector to Spark Data Type Mapping” in the User Guide.
3. Define an external table for an ORC data source:
CREATE EXTERNAL TABLE my_table_orc(a INT8 NOT NULL)
USING SPARK WITH REFERENCE='hdfs://hornet:8020/user/mark/my_table.orc';
4. Define an external table for an AVRO data source:
CREATE EXTERNAL TABLE tweets
(username VARCHAR(20),
tweet VARCHAR(100),
timestamp VARCHAR(50))
USING SPARK
WITH REFERENCE='hdfs://blue/tmp/twitter.avro',
FORMAT='com.databricks.spark.avro'
5. Define an external table for a JSON data source:
Example JSON file:
[{
     "symbol": "MSFT",
     "company": "Microsoft Corporation",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 170.777,
     "volume": 36142690,
     "sector": "TECH"
},
{
     "symbol": "AAPL",
     "company": "Apple Inc.",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 309.51,
     "volume": 49897096,
     "sector": "TECH"
},
{
     "symbol": "GOOG",
     "company": "Alphabet Inc.",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 1434.23,
     "volume": 2417214,
     "sector": "TECH"
},
{
     "symbol": "AMZN",
     "company": "Amazon.com, Inc.",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 2008.72,
     "volume": 15567283,
     "sector": "TECH"
}]
Example SQL to create an external table to reference that JSON file stored on an Azure ADL:
CREATE EXTERNAL TABLE techstocks (
     symbol VARCHAR(4),
     company VARCHAR(20),
     stock_tstamp TIMESTAMP,
     price FLOAT,
     volume INTEGER,
     sector CHAR(5)
) USING SPARK WITH REFERENCE='abfs://stockquotes@eastusstockdata.dfs.core.windows.net/tech1.json',
     FORMAT='json',
     OPTIONS=(
     'multiline'='true',
     'SCHEMA'= 'symbol string, company string, stock_tstamp string, price double, volume integer, sector string');
For more examples, see the User Guide.
Last modified date: 01/26/2023