User Guide : 13. Using External Tables
 
Share this page                  
Using External Tables
Introduction to External Tables
The External Tables feature lets you read from and write to data sources stored outside of Vector. The data source must be one that Apache Spark is able to read from and write to, such as HDFS files stored in formats like Parquet, ORC, JSON, or tables in external database systems.
The syntax CREATE EXTERNAL TABLE creates a Vector table that points at existing data files in locations outside the Vector data directories. This feature eliminates the need to import the data into a new table when the data files are already in a known location, in the desired file format.
After the data file structure is mapped to Vector format using the CREATE EXTERNAL TABLE statement, you can:
Select, join, or sort external table data
Create views for external tables
Insert data into external tables
Import and store the data into a Vector database
The data is queried from its original locations and Vector leaves the data files in place when you drop the table.
The following statements cannot be performed against external tables:
MODIFY
CREATE INDEX
How External Tables Work
Vector leverages Apache Spark's extensive connectivity through the Spark-Vector Connector for the External Tables functionality.
The External Tables architecture is composed of two main components:
Spark-Vector Provider
Vector
Vector receives queries operating on external tables from the user, rewrites them into JSON requests for external data, which are sent to the Spark-Vector Provider. The Spark-Vector Provider is a Spark application that behaves as a multi-threaded Spark server. It receives requests from Vector, translates them into Spark jobs, and launches them. These jobs typically issue queries (to SparkSQL) like “INSERT INTO vector_table SELECT * FROM external_resource” for reading external data or “INSERT INTO external_resource SELECT * FROM vector_table” for writing to external systems. Finally, these jobs use the Spark-Vector Connector to push and pull data in and out of Vector.
The DBMS configuration parameter insert_external in config.dat controls whether inserts into X100 external table are allowed. The default is ON, which allows inserts. Inserts can be blocked by setting this parameter to OFF.
The DBMS configuration parameter ext_table_no_local in config.dat determines whether the CREATE EXTERNAL TABLE and INSERT INTO EXTERNAL CSV operations can access the local file system. A value of ON prevents access to the local file system. A value of OFF (the default) allows access.
External Table Requirements
The following requirements must be met to use external tables:
Spark must be installed and its binaries available in PATH. For the required Spark version, see the VectorH readme.
To verify the Spark installation, issue:
which spark-submit
Vector 5.0 or above installation. To verify that the provider is installed, issue:
ls $II_SYSTEM/ingres/lib/spark_vector_provider.jar
Starting and Stopping the Spark-Vector Provider
Make sure the Spark-Vector Provider is enabled by setting the configuration parameter ingstart.*.spark_provider=1, as follows:
iisetres "ii.$(iipmhost).ingstart.*.spark_provider" 1
Then the Spark-Vector Provider will be started and stopped automatically with the Vector instance.
To start or stop only the provider
ingstart -spark_provider
ingstop -spark_provider