How to Add Extra Packages

15. Using External Tables : How to Add Extra Packages

Share this page

By default, the Spark-Vector Provider supports only the Spark integrated data sources (such as JDBC, JSON, Parquet) and CSV data sources (the Spark-Vector Provider is bundled with spark-csv 1.4.0).

Follow this process to add extra data sources (packages):

1. Modify $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf (as shown in the following examples).

2. Stop and start the Spark-Vector Provider to put the changes into effect, as follows:

ingstop -spark_provider

ingstart -spark_provider

Here are examples of modifying $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf to add extra data sources:

• To add support for reading and writing ORC files or to Hive tables, add the line:

spark.vector.provider.hive true

• To add extra jars, add the line:

spark.jars comma-separated-list-of-jars

• To add extra packages, add the line:

spark.jars.packages comma-separated-list-of-packages

For example, to enable support for Cassandra (spark-cassandra) and Redshift (spark-redshift), add the line:

spark.jars.packages datastax:spark-cassandra-connector:1.4.4-s_2.10,com.databricks:spark-redshift_2.10:0.6.0

Note: For Spark 1.5, to preserve a default spark configuration (for example, /etc/spark/conf/spark-defaults.conf), it must be included in $II_SYSTEM/ingres/files/spark-provider/spark_provider.conf.

• To add support for reading and writing AVRO files with Spark 1, add the line:

spark.jars.packages=com.databricks:spark-avro_2.10:2.0.1

If using Spark 2 add the line:

spark.jars.packages=com.databricks:spark-avro_2.11:3.1.0

Example external table definition for an AVRO data source:

CREATE EXTERNAL TABLE tweets

(username VARCHAR(20),

tweet VARCHAR(100),

timestamp VARCHAR(50))

USING SPARK

WITH REFERENCE='hdfs://blue/tmp/twitter.avro',

FORMAT='com.databricks.spark.avro'