VH 6.0 | Using External Tables

User Guide > User Guide > Using External Tables

Was this helpful?

Using External Tables

Introduction to External Tables

How External Tables Work

External Table Requirements

Syntax for Defining an External Table

Reading and Writing to an External Table

Adding Extra Data Sources

External Table Limitations

External Table Usage Notes

Troubleshooting External Tables

Examples of Defining External Tables and Loading Data

Introduction to External Tables

The External Tables feature lets you read from and write to data sources stored outside of Vector. The data source must be one that Apache Spark is able to read from and write to, such as files stored in formats like Parquet, ORC, JSON, or tables in external database systems.

The syntax CREATE EXTERNAL TABLE creates a Vector table that points at existing data files in locations outside the Vector data directories. This feature eliminates the need to import the data into a new table when the data files are already in a known location, in the desired file format.

After the data file structure is mapped to Vector format using the CREATE EXTERNAL TABLE statement, you can:

• Select, join, or sort external table data

• Create views for external tables

• Insert data into external tables

• Import and store the data into a Vector database

The data is queried from its original locations and Vector leaves the data files in place when you drop the table.

The following statements cannot be performed against external tables:

• MODIFY

• CREATE INDEX

How External Tables Work

Vector leverages Apache Spark's extensive connectivity through the Spark-Vector Connector for the External Tables functionality. The External Tables architecture is composed of two main components:

• Spark-Vector Provider

• Vector

Vector receives queries operating on external tables from the user, rewrites them into JSON requests for external data, which are sent to the Spark-Vector Provider. The Spark-Vector Provider is a Spark application that behaves as a multi-threaded Spark server. It receives requests from Vector, translates them into Spark jobs, and launches them. These jobs typically issue queries (to SparkSQL) like "INSERT INTO vector_table SELECT * FROM external_resource" for reading external data or "INSERT INTO external_resource SELECT * FROM vector_table" for writing to external systems. Finally, these jobs use the Spark-Vector Connector to push and pull data in and out of Vector.

The DBMS configuration parameter insert_external in config.dat controls whether inserts into X100 external table are allowed. The default is ON, which allows inserts. Inserts can be blocked by setting this parameter to OFF.

The DBMS configuration parameter ext_table_no_local in config.dat determines whether the CREATE EXTERNAL TABLE and INSERT INTO EXTERNAL CSV operations can access the local file system. A value of ON prevents access to the local file system. A value of OFF (the default) allows access.

External Table Requirements

The following requirements must be met to use external tables:

• Spark must be installed and its binaries available in PATH. For more information, see Setting Up Spark for Use with Vector on page 317.

To verify the Spark installation, issue:

which spark-submit

• Vector 6.0 or above installation. To verify that the Spark-Vector Provider is installed, issue:

ls $II_SYSTEM/ingres/lib/spark_vector_provider1.jar

ls $II_SYSTEM/ingres/lib/spark_vector_provider2.jar

Syntax for Defining an External Table

The syntax for defining an external table is as follows:

CREATE EXTERNAL TABLE table_name (column_name data_type {,column_name data_type})

USING SPARK

WITH REFERENCE='reference'

[,FORMAT='format']

[,OPTIONS=('key'=value {,'key'='value'})]

For details, see CREATE EXTERNAL TABLE in the SQL Language Guide.

Reading and Writing to an External Table

After external tables are defined with the CREATE EXTERNAL TABLE syntax, they behave like regular Vector tables. You can issue queries like the following:

SELECT * FROM test_table_csv

INSERT INTO my_table_orc SELECT some_column FROM other_table

Adding Extra Data Sources

By default, the Spark-Vector Provider supports only the Spark integrated data sources (such as JDBC, JSON, Parquet) and CSV data sources (the Spark-Vector Provider is bundled with spark-csv 2.10).

For information on how to add extra data sources, see How to Add Extra Data Sources on page 321.

External Table Limitations

The external table feature has the following limitations:

• WARNING! The options retained in the CREATE EXTERNAL TABLE definition are not secure; they can be shown in clear text in tracing and logging. We recommend that no sensitive information be used in the OPTIONS portion of the definition. This has implications for JDBC data sources.

• The Spark-Vector Provider is a Spark application running under the user that is the owner of the Vector installation, typically actian. This means that only data sources for which that user has permissions to access (read or write) can be used in Vector.

• Operations not supported on external tables are usually reported with explicit errors. Unsupported operations include the following:

– Creating keys (of any type) and creating indexes (of any type) are not supported because they cannot be enforced (that is, they can be violated by the actions of an external entity)

– Adding and dropping columns

– Updates and deletes

External Table Usage Notes

Note the following when using external tables:

• For writing to external tables, the mode is SaveMode.Append. Some data sources do not support this mode. For example, you cannot write to existing CSV files because spark-csv does not support this mode.

• CREATE EXTERNAL TABLE does not validate the supplied values for REFERENCE, FORMAT, or OPTIONS until the external table is used. So, although confusing at first, the following use case will result in an error if the target does not yet exist:

CREATE EXTERNAL TABLE test_table(col INT NOT NULL) USING SPARK

WITH REFERENCE='hdfs://cluster06:8020/user/mark/test_table.json';

SELECT * FROM test_table; \g

Executing . . .

E_VW1213 External table provider reported an error 'java.io.IOException:

No input paths specified in job'.

However, as soon as the VectorH user inserts some data, the external table is created at its original location (for example, files are written to HDFS or a new table is created in Hive) and a subsequent SELECT statement will succeed:

INSERT INTO test_table VALUES (1);

SELECT * FROM test_table; \g

Executing . . .

(1 row)

┌─────────────┐

│col │

├─────────────┤

│ 1 │

└─────────────┘

(1 row)

Last modified date: 01/26/2023