User Guide > User Guide > Using External Tables
Was this helpful?
Using External Tables
Introduction to External Tables
The External Tables feature lets you read from and write to data sources stored outside of Vector. The data source must be one that Apache Spark is able to read from and write to, such as files stored in formats like Parquet, ORC, JSON, or tables in external database systems.
The syntax CREATE EXTERNAL TABLE creates a Vector table that points at existing data files in locations outside the Vector data directories. This feature eliminates the need to import the data into a new table when the data files are already in a known location, in the desired file format.
After the data file structure is mapped to Vector format using the CREATE EXTERNAL TABLE statement, you can:
Select, join, or sort external table data
Create views for external tables
Insert data into external tables
Import and store the data into a Vector database
The data is queried from its original locations and Vector leaves the data files in place when you drop the table.
The following statements cannot be performed against external tables:
MODIFY
CREATE INDEX
How External Tables Work
Vector leverages Apache Spark's extensive connectivity through the Spark-Vector Connector for the External Tables functionality. The External Tables architecture is composed of two main components:
Spark-Vector Provider
Vector
Vector receives queries operating on external tables from the user, rewrites them into JSON requests for external data, which are sent to the Spark-Vector Provider. The Spark-Vector Provider is a Spark application that behaves as a multi-threaded Spark server. It receives requests from Vector, translates them into Spark jobs, and launches them. These jobs typically issue queries (to SparkSQL) like "INSERT INTO vector_table SELECT * FROM external_resource" for reading external data or "INSERT INTO external_resource SELECT * FROM vector_table" for writing to external systems. Finally, these jobs use the Spark-Vector Connector to push and pull data in and out of Vector.
The DBMS configuration parameter insert_external in config.dat controls whether inserts into X100 external table are allowed. The default is ON, which allows inserts. Inserts can be blocked by setting this parameter to OFF.
The DBMS configuration parameter ext_table_no_local in config.dat determines whether the CREATE EXTERNAL TABLE and INSERT INTO EXTERNAL CSV operations can access the local file system. A value of ON prevents access to the local file system. A value of OFF (the default) allows access.
External Table Requirements
The following requirements must be met to use external tables:
Spark must be installed and its binaries available in PATH. For more information, see Setting Up Spark for Use with Vector on page 317.
To verify the Spark installation, issue:
which spark-submit
Vector 6.0 or above installation. To verify that the Spark-Vector Provider is installed, issue:
ls $II_SYSTEM/ingres/lib/spark_vector_provider1.jar
or
ls $II_SYSTEM/ingres/lib/spark_vector_provider2.jar
Syntax for Defining an External Table
The syntax for defining an external table is as follows:
CREATE EXTERNAL TABLE table_name (column_name data_type {,column_name data_type})
USING SPARK
WITH REFERENCE='reference'
[,FORMAT='format']
[,OPTIONS=('key'=value {,'key'='value'})]
For details, see CREATE EXTERNAL TABLE in the SQL Language Guide.
Reading and Writing to an External Table
After external tables are defined with the CREATE EXTERNAL TABLE syntax, they behave like regular Vector tables. You can issue queries like the following:
SELECT * FROM test_table_csv
INSERT INTO my_table_orc SELECT some_column FROM other_table
Adding Extra Data Sources
By default, the Spark-Vector Provider supports only the Spark integrated data sources (such as JDBC, JSON, Parquet) and CSV data sources (the Spark-Vector Provider is bundled with spark-csv 2.10).
For information on how to add extra data sources, see How to Add Extra Data Sources on page 321.
External Table Limitations
The external table feature has the following limitations:
WARNING! The options retained in the CREATE EXTERNAL TABLE definition are not secure; they can be shown in clear text in tracing and logging. We recommend that no sensitive information be used in the OPTIONS portion of the definition. This has implications for JDBC data sources.
The Spark-Vector Provider is a Spark application running under the user that is the owner of the Vector installation, typically actian. This means that only data sources for which that user has permissions to access (read or write) can be used in Vector.
Operations not supported on external tables are usually reported with explicit errors. Unsupported operations include the following:
Creating keys (of any type) and creating indexes (of any type) are not supported because they cannot be enforced (that is, they can be violated by the actions of an external entity)
Adding and dropping columns
Updates and deletes
External Table Usage Notes
Note the following when using external tables:
For writing to external tables, the mode is SaveMode.Append. Some data sources do not support this mode. For example, you cannot write to existing CSV files because spark-csv does not support this mode.
CREATE EXTERNAL TABLE does not validate the supplied values for REFERENCE, FORMAT, or OPTIONS until the external table is used. So, although confusing at first, the following use case will result in an error if the target does not yet exist:
CREATE EXTERNAL TABLE test_table(col INT NOT NULL) USING SPARK
WITH REFERENCE='hdfs://cluster06:8020/user/mark/test_table.json';
SELECT * FROM test_table; \g
Executing . . .
E_VW1213 External table provider reported an error 'java.io.IOException:
No input paths specified in job'.
However, as soon as the VectorH user inserts some data, the external table is created at its original location (for example, files are written to HDFS or a new table is created in Hive) and a subsequent SELECT statement will succeed:
INSERT INTO test_table VALUES (1);
SELECT * FROM test_table; \g
Executing . . .
(1 row)
 
┌─────────────┐
│col          │
├─────────────┤
│ 1           │
└─────────────┘
(1 row)
Last modified date: 01/26/2023