Was this helpful?
CREATE EXTERNAL TABLE Examples
Note:  The column names in the Vector external table must match the column names in the reference source.
1. Define an external table for a CSV data source:
CREATE EXTERNAL TABLE ext_csv (col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL)
USING SPARK
WITH REFERENCE='user/mark/file.csv',
  OPTIONS=('DELIMITER' = '|', 'HEADER' = 'TRUE','SCHEMA'='col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL');
Notes:
If the CSV file does not have a header, the name of the columns in the external table must match those in spark-csv, that is, C0, C1, .... If the option 'HEADER' = 'TRUE' is not specified and the Vector external table definition has different names, you will get an error like:
E_VW1213 External table provider reported an error
'org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns C0, C1; line 1 pos 38'.
If external table columns are defined as NOT NULL, the table schema must be specified with NOT NULL in the SCHEMA option, even if header is set to true. This is because if Spark SQL implicitly determines the schema information from, for example, a CSV file, the nullability of columns typically defaults to true.
2. Define an external table for a CSV data source residing in Amazon S3. The CSV data does not have a header row:
CREATE EXTERNAL TABLE nation_s3 (
   n_nationkey INTEGER NOT NULL,
   n_name CHAR(25) NOT NULL,
   n_regionkey INTEGER NOT NULL
) USING SPARK WITH REFERENCE='s3a://<bucket>/nation.csv',FORMAT='csv',
OPTIONS=('HEADER'='FALSE','DELIMITER'='|','SCHEMA'='n_nationkey INT NOT NULL, n_name STRING NOT NULL, n_regionkey INT NOT NULL');
Note:  The column names in the schema must match the column names of the external table. The Spark data type specified in the schema must be compatible with the Vector data type for the matching column. See “Vector to Spark Data Type Mapping” in the User Guide.
3. Define an external table for a JSON data source:
Example JSON file:
[{
     "symbol": "MSFT",
     "company": "Microsoft Corporation",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 170.777,
     "volume": 36142690,
     "sector": "TECH"
},
{
     "symbol": "AAPL",
     "company": "Apple Inc.",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 309.51,
     "volume": 49897096,
     "sector": "TECH"
},
{
     "symbol": "GOOG",
     "company": "Alphabet Inc.",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 1434.23,
     "volume": 2417214,
     "sector": "TECH"
},
{
     "symbol": "AMZN",
     "company": "Amazon.com, Inc.",
     "stock_tstamp": "2020-01-31T21:00:00+00:00",
     "price": 2008.72,
     "volume": 15567283,
     "sector": "TECH"
}]
Example SQL to create an external table to reference that JSON file stored on an Azure ADL:
CREATE EXTERNAL TABLE techstocks (
     symbol VARCHAR(4),
     company VARCHAR(20),
     stock_tstamp TIMESTAMP,
     price FLOAT,
     volume INTEGER,
     sector CHAR(5)
) USING SPARK WITH REFERENCE='json file path need to mention',
     FORMAT='json',
     OPTIONS=(
     'multiline'='true',
     'SCHEMA'= 'symbol string, company string, stock_tstamp string, price FLOAT, volume integer, sector string');\g
4. Remove all rows where text column contains NULL character. The filtering is done completely on Spark side with possible predicate pushdown to the data source:
CREATE EXTERNAL TABLE filter_test(id INT, text VARCHAR(20))
USING SPARK WITH REFERENCE='test.csv',
FORMAT='csv',
OPTIONS=(
'header'='false',
'schema'='id Integer, text String',
'filter'='text NOT LIKE "%\\u0000%"'
);
5. Create an external table test_ext referencing a Vector table test in another Vector instance:
CREATE EXTERNAL TABLE test_ext(id Int, text VARCHAR(20))
USING SPARK WITH REFERENCE='dummy',
FORMAT='vector',
OPTIONS=(
'host'='localhost',
'port'='VW7',
'database'='testdb',
'table'='test',
'user'='actian',
'password'='actian',
'staging'='select id, replace(text, "e", "i") as text from THIS_TABLE’,
);
Staging SparkSQL creates an intermediate table that does string replacement. THIS_TABLE is a designated keyword.
6. Map schema by simply unnesting a nested parquet file:
CREATE EXTERNAL TABLE unnested(id INT, name VARCHAR(20), surname VARCHAR(20))
USING SPARK WITH REFERENCE='nested.parquet',
FORMAT='parquet',
OPTIONS=(
'SCHEMA'='id integer, fullname struct(name string, surname string)',
'STAGING'='select id, fullname.name as name, fullname.surname as surname from THIS_TABLE'
);
SCHEMA describes the nested schema of the referenced parquet file. STAGING maps the source schema onto the schema of the external table.
Note:  The schema of the staging table must match the schema of the external table.
Note:  In the staging table expression, you can use all functions provided by SparkSQL (https://spark.apache.org/docs/3.1.1/api/sql/index.html). However, using NOT, AND, OR in a WHERE clause could raise an exception if Spark decides for predicate pushdown, because the Spark-Vector Provider doe not currently support pushdown of those predicates.
Note:  This feature is available for Spark Provider 3 only.
For more examples, see the chapter “Using External Tables” in the User Guide.
Last modified date: 06/28/2024