Spark-Vector Provider Log

User Guide : 13. Using External Tables : Spark-Vector Provider Log

Share this page

To troubleshoot issues with the Spark-Vector Provider, check the log file: $II_SYSTEM/ingres/files/spark_provider.log.

Configuring Logging

The log4j.properties file in $II_SYSTEM/ingres/files/spark-provider controls the logging levels for messages printed to spark_provider.log by the Spark-Vector Provider application. The default logging level for the provider is INFO; the default logging level for all other components is ERROR. You can modify this behavior.

For example, to see all messages written by Spark-Vector components at DEBUG level or higher, you can un-comment the following line:

log4j.logger.com.actian.spark_vector=DEBUG

Spark-Vector Provider Configuration

The configuration for the Spark-Vector Provider is read from $SPARK_HOME/conf/spark-defaults.conf. To override these settings for the provider, you can specify Spark configuration options in $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf.

Note: For the configuration options in spark-defaults.conf to take effect permanently, they must be included in spark_provider.conf.

Any additional features that cannot be configured through the configuration file can be added directly to the spark-submit command that starts the Spark-Vector Provider in $II_SYSTEM/ingres/bin/spark_provider.

Syntax for Defining an External Table

The syntax for defining an external table is as follows:

CREATE EXTERNAL TABLE table_name (column_name data_type {,column_name data_type})

USING SPARK

WITH REFERENCE='reference'

[,FORMAT='format']

[,OPTIONS=('key'=value {,'key'='value'})]

For details, see CREATE EXTERNAL TABLE in the SQL Language Guide.

Reading and Writing to an External Table

After external tables are defined with the CREATE EXTERNAL TABLE syntax, they behave like regular Vector tables. You can issue queries like the following:

SELECT * FROM test_table_csv

INSERT INTO my_table_orc SELECT some_column FROM other_table

How to Add Extra Packages

By default, the Spark-Vector Provider supports only the Spark integrated data sources (such as JDBC, JSON, Parquet) and CSV data sources (the Spark-Vector Provider is bundled with spark-csv 1.4.0).

Follow this process to add extra data sources (packages):

1. Modify $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf (as shown in the following examples).

2. Stop and start the Spark-Vector Provider to put the changes into effect, as follows:

ingstop -spark_provider

ingstart -spark_provider

Here are examples of modifying $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf to add extra data sources:

• To add support for reading and writing ORC files or to Hive tables, add the line:

spark.vector.provider.hive true

• To add extra jars, add the line:

spark.jars comma-separated-list-of-jars

• To add extra packages, add the line:

spark.jars.packages comma-separated-list-of-packages

For example, to enable support for Cassandra (spark-cassandra) and Redshift (spark-redshift), add the line:

spark.jars.packages datastax:spark-cassandra-connector:1.4.4-s_2.10,com.databricks:spark-redshift_2.10:0.6.0

Note: For Spark 1.5, to preserve a default spark configuration (for example, /etc/spark/conf/spark-defaults.conf), it must be included in $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf.

• To add support for reading and writing AVRO files with Spark 1, add the line:

spark.jars.packages=com.databricks:spark-avro_2.10:2.0.1

If using Spark 2 add the line:

spark.jars.packages=com.databricks:spark-avro_2.11:3.1.0

Example external table definition for an AVRO data source:

CREATE EXTERNAL TABLE tweets

(username VARCHAR(20),

tweet VARCHAR(100),

timestamp VARCHAR(50))

USING SPARK

WITH REFERENCE='hdfs://blue/tmp/twitter.avro',

FORMAT='com.databricks.spark.avro'

External Table Limitations

The external table feature has the following limitations:

• WARNING! The options retained in the CREATE EXTERNAL TABLE definition are not secure; they can be shown in clear text in tracing and logging. We recommend that no sensitive information be used in the OPTIONS portion of the definition. This has implications for JDBC data sources.

• The Spark-Vector Provider is a Spark application running under the user that is the owner of the Vector installation, typically actian. This means that only data sources for which that user has permissions to access (read or write) can be used in Vector.

• Operations not supported on external tables are usually reported with explicit errors. Unsupported operations include the following:

– Creating keys (of any type) and creating indexes (of any type) are not supported because they cannot be enforced (that is, they can be violated by the actions of an external entity)

– Adding and dropping columns

– Updates and deletes

External Table Usage Notes

Note the following when using external tables:

• For writing to external tables, the mode is SaveMode.Append. Some data sources do not support this mode. For example, you cannot write to existing CSV files because spark-csv does not support this mode.

• CREATE EXTERNAL TABLE does not validate the supplied values for REFERENCE, FORMAT, or OPTIONS until the external table is used. So, although confusing at first, the following use case will result in an error if the target does not yet exist:

CREATE EXTERNAL TABLE test_table(col INT NOT NULL) USING SPARK

WITH REFERENCE='hdfs://cluster06:8020/user/mark/test_table.json';

SELECT * FROM test_table; \g

Executing . . .

E_VW1213 External table provider reported an error 'java.io.IOException:

No input paths specified in job'.

However, as soon as the VectorH user inserts some data, the external table is created at its original location (for example, files are written to HDFS or a new table is created in Hive) and a subsequent SELECT statement will succeed:

INSERT INTO test_table VALUES (1);

SELECT * FROM test_table; \g

Executing . . .

(1 row)

┌─────────────┐

│col │

├─────────────┤

│ 1 │

└─────────────┘

(1 row)