Vector Operators
The Vector operators provide the ability to transfer data to and from a Vector database quickly and efficiently. They manage the data in bulk using APIs or utilities specific to the Vector implementation. This allows them to provide better performance than solely using JDBC access.
For more information, refer to the following topics:
LoadActianVector Operator
The LoadActianVector operator is used to bulk-load data into an Actian Vector database. The operator provides a single input port containing the data to load into Vector. There are several methods for loading the data that are supported:
• Direct: Data is streamed directly into the Vector engine. This method is generally faster than the other methods. It is also more efficient because data is streamed directly into the Vector engine without having to stage it first. This method also supports execution within a cluster environment. As such, data can be efficiently moved from HDFS into Vector. Currently, direct loading is only supported on Linux and Windows 64-bit platforms.
• vwload: The vwload utility provides the generally fastest method to bulk load data into Vector. The utility can only be run on the Vector server machine. The user running vwload must have DBA privileges.
• SQL COPY: This command can be executed on machines remote to the Vector server. The Vector client software must be installed and configured to use SQL COPY. The user does not have to have DBA privileges to run SQL COPY. SQL COPY does require copy permission on the target table.
• COPY VWLOAD: This command can be executed on machines remote to the Vector server. The command allows the vwload utility to be executed using a SQL command. This command is useful if the vwload utility is not available on the path of the target instance. COPY VWLOAD does require copy permission on the target table.
Direct Load
When using the direct load capability, the data is streamed from the input port directly into the Vector engine. Direct loading can run in parallel and supports execution within a cluster environment such as Hadoop. Direct loading can be used to copy data from HDFS into a Vector instance. When run within a Hadoop cluster, the reading of the data, formatting and sending to Vector operations are run distributed, taking full advantage of the Hadoop resources.
When loading to Vector database up to version 4.2, direct loader sorts the data on clustered index before loading. Direct load method leverages parallel execution capabilities in DataFlow to sort the data before loading to Vector database up to version 4.2. When unique keys are defined and they are the same as sort keys, direct loader eliminates the duplicates before loading. However, this duplicate key elimination is limited to data being loaded, not to the data existing in the table already. As a result, incremental loads are not supported when sort keys or unique keys are defined on a table.
This loader has certain limitations:
• When both Sort Keys and Unique Keys are defined, both should be same, else the loading will fail.
• When Sort Key or Unique Key are specified, incremental loads are not supported. This means the target table should be empty, else the loading will fail.
• A DataFlow workflow can load to only one version of Vector at a time. When executing workflows that load to different versions of Vector, the KNIME UI should be restarted.
• To load data using the direct method, the user should have MONITOR privilege for Vector version 3.0 and DB Admin privilege for Vector version 3.5.
• There may be loss of precision when converting double to decimal value. Converting from String to Decimal preserves the precision.
• When loading Time values with time zones the timestamp type should be used to guarantee the data is loaded correctly. When the time zone is unspecified such as when using Time types, the local time zone will be used.
vwload, COPY VWLOAD, and SQL COPY
When using either the vwload, COPY VWLOAD, or SQL COPY methods for loading, the LoadActianVector operator reads its input data and stages the data into temporary files in the correct format for loading. This data staging is done in parallel according to the current parallelism set for the application, or specifically for the operator. After the data is staged into the load format, the load procedure is invoked (vwload, COPY VWLOAD, or SQL COPY). The staged files are deleted after the load completes, whether successfully or not.
The source fields can be mapped to the target table columns using a provided field map (see the renameMapping property). If a mapping is not provided, the source data fields will be mapped by position to the target table columns. The mapping happens in schema definition order of the target table. The source field names do not have to match the target columns names.
For example, given a table with three columns:
• orderkey (varchar)
• quantity (integer)
• price (double)
Loading data into this table requires three input fields that match the given types (string, integer, and double). If a field mapping is not provided, the input data will be mapped by position to the target columns. So the first input field will be mapped to column orderkey, the second input field to column quantity, and the third input field to column price. If the input data is not ordered correctly, then the load will fail. Providing a field map specifies which input fields map to which database columns. See the code examples for the details of providing a field map.
Enable rollback to abort a data load after the specified number of errors occur. The errors in question are ones caused by failing to parse records from the staged load files. When a data load is aborted and rollback is enabled, no data will be written to the target table. The maximum errors tolerated defaults to zero. Other critical system errors may abort a load directly without consideration of the max errors property.
Enable the jdbcOnly option to limit the data load to using JDBC exclusively instead of opening additional socket connections to the instance. Currently, this option only applies to COPY VWLOAD. This option can be used if additional ports other than the one used for the JDBC connection cannot be created due to security or firewall restrictions.
To capture the input records that are rejected by vwload, COPY VWLOAD, or SQL COPY load process, set the rejectsPath property. The rejected records will be written to a file at this path. If no rejects are generated, the rejects file will not be created.
The output of vwload, COPY VWLOAD, or the SQL command will be captured and written to the application log at the INFO log level. View this information for details about the load process.
Null Values
When inserting null values with either direct load or vwload, the null settings of the table will be enforced by the loader. This means that an error will be thrown when trying to load a column that is not nullable with a null.
Time and Timestamp Precision
When loading Time and Timestamp columns using vwload, COPY VWLOAD, or SQL COPY the maximum supported precision is milliseconds. If greater precision is required, direct loading should be used, which supports nanosecond precision.
Code Examples
The following code example demonstrates using the LoadActianVector operator using the Java API. The direct method is used. The user and password provided are required to be the DBA account.
Using the LoadActianVector operator in Java (directload)
// Build a map of source field names to target table column names.
// Only these fields will loaded. Other source fields are dropped.
// Target columns not mapped will be null filled.
Map<String, String> fieldMap = new HashMap<String, String>();
fieldMap.put("orderkey", "l_orderkey");
fieldMap.put("partkey", "l_partkey");
fieldMap.put("suppkey", "l_suppkey");
fieldMap.put("linenumber", "l_linenumber");
fieldMap.put("quantity", "l_quantity");
fieldMap.put("extendedprice", "l_extendedprice");
// Create the operator within an application and set the properties.
// Use direct for this load. No need to set maxErrors, rollback is disabled.
LoadActianVector loader = app.add(new LoadActianVector());
loader.setHost("database-server");
loader.setInstance("VW");
loader.setDatabaseName("tpch");
loader.setTableName("lineitem");
loader.setMethod(LoadMethod.DIRECT);
loader.setUser("user");
loader.setPassword("password");
loader.setRenameMapping(fieldMap);
loader.setRollback(false);
The following example demonstrates using the operator in RushScript. Note that because the vwload utility is used, the application must be run on the Vector server.
Using the LoadActianVector Operator in RushScript (vwload)
var fieldMap = {
'orderkey':'l_orderkey',
'partkey':'l_partkey',
'suppkey':'l_suppkey',
'linenumber':'l_linenumber',
'quantity':'l_quantity',
'extendedprice':'l_extendedprice'
};
// Load data into Vector using vwload.
dr.loadActianVector(
data,
{
host:'<IP address or FQDN of the cluster>',
instance:'VW',
databaseName:'tpch',
tableName:'lineitem',
rejectsPath:'rejects.txt',
renameMapping:fieldMap,
rollback:true,
maxErrors:10,
method:'VWLOAD',
user:'user',
password:'password'
- });
Properties
The LoadActianVector operator supports the following properties.
Name | Type | Description |
|---|
charset | String | Determines the character set that is used during staging with VWLOAD method. For the vwload supported character sets, see the Actian Vector documentation. |
cleanData | boolean | If this is enabled, additional operations and checks may be performed on the data before loading it into Vector to ensure it meets any table constraints or other requirements. Additionally, invalid values will be loaded as nulls if the table allows instead of producing errors, such as when stringTruncationError or decimalTruncationError are enabled. Default: false. |
database | String | The name of the target database. |
decimalTruncationError | boolean | Whether an error will be thrown if decimal truncation would occur during database loading. Default: false. |
extraProperties | Map(String, String) | A map of key value properties that will be used when creating the connection to the Vector database. |
finalizeTableSQL | String | The SQL statement to execute after processing all the records. For example, this could be used to execute a SQL CREATE INDEX statement. |
host | String | Name of the Vector host. |
initializeTableSQL | String | The SQL statement to execute before processing any records. |
insertMode | String | The insert mode (ROW, BULK, or DEFAULT) that applies for Insert and Merge operations when Direct loading. Default: DEFAULT. |
instance | String | Name of the Vector instance. Default: "VW" |
jdbcOnly | boolean | When jdbcOnly is enabled, the load attempts to use the JDBC connection exclusively instead of remotely executing the command. Default: false. |
maxErrors | int | The maximum errors tolerated before aborting and rolling back a data load. Only applicable when rollback is enabled. Default: 0 |
method | LoadMethod | The load method to use for loading the data. There are currently three choices with the following constraints: • DIRECT: Streams the data directly into the Vector engine. Only a privileged user can use the direct method. The user must also have access to write the target table. • VWLOAD: This utility is used to load the data. The vwload utility must be executed on the Vector server. It is generally a faster method to load data. Only the DBA user can execute vwload. • SQLCOPY: This command can be used remotely and is used to load the data. The Vector client software must be installed and configured for SQL COPY to work. The user does not require DBA permissions but must have copy access to write to the target table. • COPYVWLOAD: This command is used to load the data. It can also be used remotely to initiate execution of vwload through SQL. |
nullIndicator | String | The null indicator used if the loading method stages the files. Default is single space. |
password | String | The password for the user account that has write/copy permissions for the target table. This property is always needed for the SQL copy method. It is optionally required for use with vwload. When provided with vwload, it allows the DBA to impersonate the database user for the purpose of loading data into the target table. |
port | int | The port used by the Vector instance. Default: 7. |
rejectsPath | String | Path to the target file containing rejected records from the data load. If no records are rejected, the file will not be created. |
renameMapping | Map<String, String> | A map of source field names to target table column names. If the input data schema matches the target table schema in number of fields, order, and type of the fields, then this property is not needed. If a map is provided source fields that are not mapped are dropped. Likewise, target columns that are not mapped will be null-filled. Non-nullable fields that are not mapped will result in a SQL error. |
rollback | boolean | When rollback is enabled, the data load will be aborted and all data rolled back if the max error limit is crossed. If disabled, the data load will continue if errors occur. The rejected records will be written to the rejectsPath file. |
sshPassword | String | Operating system password used to connect to NameNode of the Hadoop installation. This operator uses this password to establish an SSH connection to the machine running the master node of the Vector in Hadoop installation. |
sshUser | String | Operating system user ID used to connect to NameNode of the Hadoop installation. This operator uses this user ID to establish an SSH connection to the machine running the master node of the Vector in Hadoop installation. |
stringTruncationError | boolean | Whether an error will be thrown if string truncation would occur during database loading. Default: false. |
table | String | The name of the table that is the target of the data load. |
tmpDirectory | String | A temporary directory to be used for storing the intermediate loader files. If not set, it will attempt to use the local default temporary directory. A directory must be specified that exists in the HDFS file system on the target Hadoop cluster when loading Vector on Hadoop with vwload or the load will not use the distributed loader. |
user | String | The database user account that has copy permissions to the target database table. This property is always needed for the SQL copy method. It is optionally required for use with vwload. When provided with vwload, it allows the DBA to impersonate the database user for the purpose of loading data into the target table. |
vectorSize | int | The size in units of rows to use for buffering input data before constructing Vector blocks. The available native memory impacts the value of this parameter. For tables with wide columns (more than 1000), this parameter should be adjusted down accordingly to avoid OutofMemoryExceptions. Default: 1024. This property is only valid when the load method is set to DIRECT. |
Ports
The LoadActianVector operator provides a single input port.
ReadActianVector Operator
The ReadActianVector operator is used to read data in bulk from an Actian Vector database. The operator provides a single output port containing the data read from Vector. This operator allows reading in parallel from Vector since by default when reading from Vector with JDBC parallelism is not supported. The operator utilizes the direct read capability to stream the data from the Vector engine directly to the output port of the operator.
This reader has certain limitations:
• The order of the data received from the table cannot be guaranteed even if the original table is sorted.
• Reading UTF-16 data from string columns is currently not supported
• A DataFlow workflow can read from one version of Vector at a time. When executing workflows that read from different versions of Vector, the KNIME UI should be restarted.
Code Examples
The following code example demonstrates using the ReadActianVector operator using the Java API. The user and password provided are required to be the DBA account.
Using the ReadActianVector operator in Java
// Create the operator within an application and set the properties.
ReadActianVector reader = app.add(new ReadActianVector());
reader.setUser("user");
reader.setPassword("password");
reader.setHost("database-server");
reader.setInstance("VW");
reader.setDatabase("tpch");
reader.setTable("lineitem");
reader.setVectorSize();
reader.setMaxErrors();
reader.setIncludedColumns();
Using the ReadActianVector Operator in RushScript
// Read data from Vector
dr.readActianVector(
data,
{
user:'user',
password:'password',
host:'database-server',
instance:'VW',
database:'tpch',
table:'lineitem'
});
Properties
The ReadActianVector operator supports the following properties.
Ports
The ReadActianVector operator provides a single input port.