I/O Nodes

The ARFF Reader node supports the Attribute-Relation File Format (ARFF). See the ARFF Weka page for more information. ARFF supports both sparse and dense data formats. The format is detected automatically. ARFF contains a metadata section that defines the schema of the ARFF file. Hence there are very few options needed to configure the reader.

Dialog Options

Source

Location

Specifies the source location from which to read the input as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multifile read.

Read file locally

Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this checkbox should be selected.

Include source information in output

Indicates that output records should include fields identifying their original source and location within the source:

• sourcePath – a string naming the original source file from which the record originated.

• splitOffset – a long providing the starting byte offset of the the parse split in the file.

• recordOffset – a long providing the starting offset for the record within the parse split.

Together, these fields can be used to reconstruct the original source ordering.

Note: These fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.

Settings

Relation name

(Read-only) Displays the @relation attribute of the ARFF metadata.

Value delimiter

Specifies the character used to delimit field values when they contain characters with special meaning to ARFF. ARFF defines only two options for delimiting values: single quotes (') and double quotes (").

Data format

Specifies how records should be represented: densely or sparsely.

Comments

(Read-only) Shows comments found in the ARFF metadata.

Edit Schema

Selecting the edit schema button will bring up the configure schema dialog window. This dialog can also be accessed through the preview's context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.

Encoding (Advanced)

Character set

Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.

Buffers (Advanced)

Read buffer

Specifies the size (in KB) of the buffer to use for performing I/O.

Decode buffer

Specifies the size (in KB) of the buffer to use for decoding bytes into character data.

Splits (Advanced)

Do not split source files

Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.

Ports

Output Ports

0 - DataFlow dataset

Avro Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadAvro Operator to Read Apache Avro Data.

The Avro reader supports the Apache Avro format. Avro files contain metadata defining the schema of the file. Therefore, very few options are needed to configure the reader.

Dialog Options

Source

Location

Specifies the source location from which to read the input as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multifile read.

Read file locally

Include source information in output

Indicates that output records should include fields identifying their original source and location within the source:

• sourcePath - a string naming the original source file from which the record originated.

• splitOffset - a long providing the starting byte offset of the the parse split in the file.

• recordOffset - a long providing the starting offset for the record within the parse split.

Together, these fields can be used to reconstruct the original source ordering.

Note: These fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.

Settings

Compression algorithm

(Read-only) Displays the compression algorithm used when writing the file.

Edit Schema

Selecting the edit schema button will bring up the configure schema dialog window. This dialog can also be accessed through the preview context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.

Buffers (Advanced)

Read buffer

Specifies the size in kilobytes of the buffer to use for performing I/O.

Splits

Select Do not split source files to indicate whether parallelism should only be applied on each file. Typically, reading may be parallelized by using subsections (or splits) of a file. Select Use default split sizes to use the default size (1024 KB). If default split sizes are not used, then specify the split sizes in Target split size field.

Ports

Output Ports

0 - DataFlow dataset

Database Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadFromJDBC Operator to Read Databases.

Reads from a relational database table, storing the results in a staging dataset.

Click Fetch Metadata to get the information about all the visible tables. This will query the database to get all the schema information that is visible. It is useful while constructing the query.

Note: This operation may take longer on very large databases.

Configuration

Configuration consists of the following steps:

1. Select a preconfigured JDBC connection to use. To configure the connection, see Managing JDBC Connections.

2. (Optional) Select or enter a fetch size.

Note: If you are using MySQL, then this setting is ignored, and the proper settings for efficient streaming are used implicitly.

3. Enter an appropriate query. Typical queries are of the form select * from tableName. More powerful queries are also supported to the extent that support is provided by the JDBC driver and the database to which you are connecting.

4. Click Refresh. This serves to validate your query and connection settings and also populates the schema metadata. You should refresh if the query or underlying table schema ever change.

5. Configure types of any strings columns that should be treated as enumerated types.

Dialog Options

connection

Select a preconfigured JDBC connection to use. To configure the connection, see Managing JDBC Connections.

query

Specifies the SQL query to execute. The results of the query include the contents of the dataset.

parameter query

Specifies a query to execute that provides parameters for the parametrized data query.

fetch size

Specifies the fetch (or buffer) size to use. Specifies the number of rows the database driver must fetch each time from the server.

Note: For MySQL, this setting has no effect and is overridden by the proper setting for efficient streaming (Integer.MIN_VALUE).

schema

Configures the schema. This is automatically discovered when you click the Refresh button. In addition, you may choose to treat any of the string fields as enumerated fields by altering their type. In some cases when utilizing stored procedures in the query, the refresh button may not be able to discover the complete schema. In this situation the Edit Schema button will be enabled and can be used to manually modify the schema.

configure schema during execute

Selecting this option causes the node to determine the output type at time of execution. This option should only be used when the correct schema cannot be determined by refreshing. Generally this should only be used when calling stored procedures that dynamically generate the output type during execution. In this case the correct output type cannot be discovered without executing the full query during composition. This can potentially cause a loss of performance and should be used with great caution.

Ports

Input Ports

0 - Query parameters (optional)

Output Ports

0 - DataFlow dataset

Delimited Text Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadDelimitedText Operator to Read Delimited Text.

This node reads a text file of delimited records. Records are separated by new lines, represented either Windows-style (CRLF) or UNIX-style (LF). By default, the style is automatically detected.

Delimited text supports the following separators used to identify field boundaries:

• A field separator, found between individual fields

• Field delimiters, marking the beginning or end of a field value

The field separator cannot be empty. The field delimiter can be empty. It is not expected that all fields start and end with a delimiter, though if one starts with a delimiter it must end with one.

Fields containing significant characters, such as white space and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the field delimiter, it is escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.

Delimited files may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists or not. If it does, the reader will skip the header row; otherwise the first row is treated as a record and will appear in the output. Additionally, when discovering the schema of a delimited file with a header row, the field names will be taken from the header.

Data from the input file may be browsed within the configuration dialog. Once the field separator and other options are set, the fields within the input data should be parsed correctly and displayed in the data table within the configuration dialog.

An attempt will be made to identify the names and data types of each field. However, the names and data types of each field may be modified by right clicking on the column header of the field and following the menu items on the context menu. Select the option to configure the schema to modify the names and data types of fields.

The reader can tolerate some degree of error in the input file. How it reacts to errors is configurable. Typically, one of five actions can be taken when an error is discovered.

• IGNORE - Error is ignored; the field in question is assigned a value of NULL.

• DISCARD - Entire record is discarded.

• WARN - Error is ignored and a warning logged; the field in question is assigned a value of NULL.

• WARN AND DISCARD - Entire record is discarded and a warning logged.

• ERROR - Read fails.

Additionally, a limit can be placed on warnings, causing the read to fail if a certain threshold is reached.

When setting a field’s data type to enumerated, you have the option to discover the values of the enumeration from the input data. Click Start Discovery to begin the discovery process. The input data will be read, and each unique value of the selected field will become a category within the enumeration. You can either wait for the discovery to complete reading the whole files or click Stop Discovery to complete the operation. You should only stop the discovery process if you know that all of the categorical values have been discovered.

Dialog Options

Source

Location

A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multi-file read.

Read file locally

Include source information in output

Indicates that output records should include fields identifying their original source and location within the source:

• sourcePath - a string naming the original source file from which the record originated.

• splitOffset - a long providing the starting byte offset of the the parse split in the file.

• recordOffset - a long providing the starting offset for the record within the parse split.

Together, these fields can be used to reconstruct the original source ordering. Note that these fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.

Settings

Record separator

Specifies the record separator, marking the end of a record. Common separators are selectable, including:

• automatically using the system default

• UNIX-style (LF)

• Windows-style (CRLF)

Other non-empty sequences may also be entered as the record separator.

Field separator

Specifies the field separator that separates the individual fields. Common separators are selectable from the list, although any non-empty sequence may also be entered as the field separator.

Field delimiters

Specifies the field delimiters, marking the beginning or end of a field value. These can either be the same (as is usually the case) or different. An empty sequence can only be used if both delimiters are the same, indicating no field delimiters.

Line comment

Specifies the line comment marker, indicating the line is a comment, not a record. Comments are only detected at the start of a record, after a record separator or at the beginning of the file.

Header row

Indicates whether the input contains a header row. If multiple files are being read, it is expected that either all files or no files have a header row.

Edit Schema

After you have successfully connected to a data file, clicking Edit Schema opens the Configure Schema dialog. This dialog can also be accessed through the preview’s context menu. This window lets you view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to reuse elsewhere and set various defaults for each field individually.

Encoding (Advanced)

Character set

Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.

Errors (Advanced)

On missing field

Indicates the behavior of the read when an expected field is not found.

On field error

Indicates the behavior of the read when a field value cannot be parsed to the field's data type.

On missing field

Indicates the behavior of the read when an extra field is not found.

Maximum field length

Specifies the maximum length (in characters) permitted for a field before it is considered a field error.

Fail if record warnings exceed threshold

Specifies a hard limit on the number of records that generate warnings before terminating the read. Only errors that are configured to raise a warning are counted; IGNORE and DISCARD actions do not count against this limit. Multiple warnings for a record are only counted as one record warning.

Buffers (Advanced)

Read buffer

Specifies the size (in KB) of the buffer to use for performing I/O.

Decode buffer

Specifies the size (in KB) of the buffer to use for decoding bytes into character data.

Splits (Advanced)

Do not split source files

Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.

Schema Discovery (Advanced)

Number of characters to analyze

The size of the initial portion of the file to analyze for schema discovery purposes.

Null indicator

A string used to represent NULL valued fields in the discovered schema.

Date format

The format to use for discovered date valued fields. This may either be the default ISO-8601 date (yyyy-MM-dd format) used by DataFlow or a user-specifed format.

Timestamp format

The format to use for discovered timestamp-valued fields. This may either be the default ISO-8601 time stamp (yyyy-MM-dd'T'hh:mm:ss.SSSZ format) used by DataFlow or a user-specifed format.

Metadata (Advanced)

Use available metadata

Specifies whether discovered metadata about the ordering and distribution of the data should be used.

Ports

Output Ports

0 - DataFlow dataset

HBase Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadHBase Operator to Read from HBase.

This node allows reading from Apache HBase.

Cluster

The first step in configuring the operator is to specify an HBase Cluster and mapping method. You may select any cluster that you have configured in the HBase Clusters preferences page.

After you have specified a cluster and mapping method, click Connect to connect to the HBase cluster. Proceed to either the field mapping or the HCatalog table selection.

Field Mapping

After a connection to the cluster is established, a list of available tables will be loaded. Select the table you want to read and click Get Schema. This will load available schema information, so you can map HBase qualifiers to output fields.

If the table was previously written by the DataFlow HBase Writer, the reader will be able to retrieve schema information about that table, including qualifiers and their types. This information will be automatically loaded into the schema mapping list, where you can assign an output field to each qualifier. If there are any additional qualifiers you have added to the table outside of DataFlow, you may manually add them using the Add Qualifier button.

If the table has not been written by the DataFlow HBase Writer, the only information available will be the table’s families. Select a family and click Add Qualifier to add a new qualifier to read. After a qualifier is added, you can specify an output field name. These fields will have a binary type, which may be converted downstream using another operator.

HCatalog

After a connection to the cluster is established, a list of available HCatalog databases will be loaded. Select the database, and a list of tables in that database is loaded. Select the HCatalog table from which to read.

Time Range

You may limit the records returned to a certain timestamp range by selecting the Limit by Time Range option. When this option is selected, you may enter a start time and an end time in the corresponding fields.

Dialog Options

HBase Master

HBase Cluster

Specifies the HBase cluster to which to connect. These are configured in the HBase Clusters preferences page.

Field Mapping

Select this option if you wish to specify a field mapping.

HCatalog

Select this option if you wish to read from a table in HCatalog.

Connect

Connects to the specified HBase master. This connection must be established before further configuration options can be set.

Cancel

Cancels the connection attempt in progress.

Disconnect

Disconnects from the HBase cluster in order to select a new one.

Table

Specifies the table to read.

Get Schema

Queries HBase for schema information about the selected table. This schema information must be retrieved before further configuration options can be set.

If the table was previously written by DataFlow, a full schema, with families, qualifiers, subfields, and types will be available.

If the table was not written by DataFlow, only the families will be available. You will need to specify the qualifiers to be read manually, which will be read as binary types.

Cancel

Cancels the schema retrieval attempt in progress.

Mapping Table

Source HBase Cell

Specifies the name of the family, qualifier, or subfield in HBase.

Type

Specifies the type of the cell.

Include

Indicates whether the corresponding cell should be read by the operator.

Target Field Name

Specifies the name of the output field, which will contain data from the corresponding cell.

(Row ID)

Specifies the output field which will contain the HBase row key.

(Timestamp)

If Include is selected for this field, the specified output field will contain the HBase timestamp.

Add Qualifier

Adds a qualifier to read to the currently selected family. A dialog will prompt you for the name of the qualifier.

Rename

Renames the selected family or qualifier. A dialog will prompt you for the new name.

Remove

Removes the selected family or qualifier.

HCatalog

Database

Specifies the HCatalog database from which to read the table schema.

Table

Specifies the HCatalog table from which to read the table schema.

Limit by Time Range

If selected, only rows with timestamps between the start and end time will be read.

Start Time

Specifies the beginning of the time range to read, such as Jan 1, 1970 12:00:00 AM.

End Time

Specifies the end of the time range to read, such as Dec 31, 2099 11:59:59 PM.

Ports

Output Ports

0 - DataFlow dataset

Log Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadLog Operator to Read Log Data.

The Log Reader specifies a node for reading log data files. The various log types can accept a format specifier string in the same format as would be accepted by the particular log producer unless otherwise specified. The output is determined by the format and the logs being read.

The minimum configuration required is setting the log type property and the source, although some formats may additionally require the pattern property be set to produce meaningful output. The node will by default attempt to determine the newline character used by the log files, however this can be specified if known in advance.

Log entries that cannot be parsed will not be present in the output while fields that cannot be parsed will be null valued.

Dialog Options

Source

Location

Specifies the source location from which to read the input. This is specified as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the filesystem, also indicating a multi-file read.

Read file locally

Include source information in output

Indicates that output records should include fields identifying their original source and location within the source:

• sourcePath - a string naming the original source file from which the record originated.

• splitOffset - a long providing the starting byte offset of the the parse split in the file.

• recordOffset - a long providing the starting offset for the record within the parse split.

Settings

Log Type

Determines the log type of the source. For more information about the available log types, see Reading Log Data.

Log Pattern

Specifies the formatting pattern that will be used when parsing the log. Valid patterns are determined by the type of the log. See documentation on specific log types for more information.

Newline

Defines the newline character used in the source. If set to automatic the reader will attempt to automatically infer the newline characters used.

Discovery

If this is enabled, format discovery be attempted if the log type supports it. It may allow discovery of the schema or other metadata. See documentation on specific log types for more information.

Edit Schema

Clicking Edit Schema will bring up the configure schema dialog window. This dialog can also be accessed through the preview's context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.

Encoding (Advanced)

Character set

Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.

Errors (Advanced)

On missing field

Indicates the behavior of the read when an expected field is not found.

On field error

Indicates the behavior of the read when a field value cannot be parsed to the field's data type.

On expected field

Indicates the behavior of the read when an extra field is not found.

Maximum field length

Specifies the maximum length (in characters) permitted for a field before it is considered a field error.

Fail if record warnings exceed threshold

Specifies a hard limit on the number of records which generate warnings before terminating the read. Only errors that are configured to raise a warning are counted; IGNORE and DISCARD actions do not count against this limit. Multiple warnings for a record are only counted as one record warning.

Buffers (Advanced)

Read buffer

Specifies the size in kilobytes of the buffer to use for performing I/O.

Decode buffer

Specifies the size in kilobytes of the buffer to use for decoding bytes into character data.

Splits (Advanced)

Do not split source files

Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.

Ports

Output Ports

0 - DataFlow dataset

MDF Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Read MDF Operator to Read MDF Files.

The MDF File Reader node enables reading records from MDF format files. It can read data previously written using MDF format. Currently only MDF version 4.0 or greater is supported. For more information, see https://www.asam.net/standards/detail/mdf/wiki. This node will read the selected records with the specified ID from a single channel within the file.

Note: The MDF Reader node currently does not support virtual channels, scale conversions, or compressed data blocks. These features will be added in a future release.

Dialog Options

Input file location

The source location from which to read the input. This is specified as a DataRush-style file path, supporting URI-like syntax for non-local file systems such as HDFS. You may use the Browse… button to browse the local and any configured remote file systems for a location.

Data Channel

The data channel or data group that will be scanned for the specified record ID.

Record ID

The channel group within the selected data channel that will be read. The channel group is referenced by its record ID, or optionally by 0 if it is the only channel group.

Apply Conversions

Specifies whether to apply any conversion rules supplied with the raw data.

Ports

Output Ports

0 - DataFlow dataset

ORC File Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadORC Operator to Read Apache ORC Files.

A node for reading records from ORC format files.

Note: For Hive-generated ORC files, the column names are stored as _col0, _col1, ... by default. The same column names are displayed using the ORC file reader operator.

Dialog Options

Input file or directory

The path to a file or a directory to read. If a directory is selected all files within the directory will be read as ORC files of an equivalent schema.

Port

Output Port

0 - DataFlow dataset

Parquet File Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadParquet Operator to Read Apache Parquet Files.

This node enables the reading of records from Parquet format files.

Dialog Options

Input file or directory

Indicates the path to a file or a directory that must be read. If a directory is selected, then all the files within the directory are read as Parquet files of an equivalent schema.

Port

Output Port

0 - DataFlow dataset

PMML Reader

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadPMML Operator to Read PMML.

Reads a Predictive Model Markup Language (PMML) model from a file. If you are using KNIME version 2.10 or later, then PMML version 4.2.x is supported. If you are using KNIME 2.9.x, then PMML version 4.0 or 4.1 is supported. For PMML definition, visit the website Data Mining Group.

Note: You must use the DataFlow PMML reader when reading a model intended for DataFlow analytics. The standard KNIME PMML reader is incompatible with DataFlow analytics. This is because there are additional PMML elements used by DataFlow analytics that are not supported by the standard KNIME PMML reader. On the other hand, PMML produced by DataFlow analytics can be written using the standard KNIME PMML writer.

Ports

Output Ports

0 - DataFlow PMML Model

Staging Importer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReadStagingDataset Operator to Read Staging Datasets.

Imports a DataFlow staging dataset. This allows you to exchange datasets between KNIME workflows. In addition, the input format is the standard produced by the DataFlow library operator WriteStagingDataset, so it can easily integrate with other applications built on the DataFlow library.

Dialog Options

Source

Location

Specifies the source location from which to read the input. This is specified as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multifile read.

Read file locally

Include source information in output

Indicates that output records should include fields identifying their original source and location within the source:

• sourcePath - a string naming the original source file from which the record originated.

• splitOffset - a long providing the starting byte offset of the the parse split in the file.

• recordOffset - a long providing the starting offset for the record within the parse split.

Together, these fields can be used to reconstruct the original source ordering.

Note: These fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.

Settings

Staging format

Specifies a read-only value showing the format of the input dataset.

Records per block

Specifies a read-only value showing the number of records stored per format block in the input dataset.

Edit Schema

Clicking Edit Schema opens the configure schema dialog window. This dialog can also be opened through the preview context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.

Buffers (Advanced)

Read buffer

Specifies the size in kilobytes of the buffer to use for performing I/O.

Ports

Output Ports

0 - DataFlow dataset

Write Nodes

Actian Vector Writer

Actian Vector Writer writes data in Actian Vector format.

ARFF Writer

ARFF Writer writes data in Attribute-Relation File Format (ARFF).

Avro Writer

Avro Writer writes data in Apache Avro format.

Database Delete

Database Delete deletes a dataset from a relational database table.

Database Update

Database Update updates a relational database table with a dataset.

Database Writer

Database Writer writes a dataset to a relational database table.

Delimited Text Writer

Delimited Text Writer writes a delimited text file.

HBase Writer

HBase Writer writes data to an Apache HBase table.

Actian Matrix Loader

Load Actian Matrix loads data into an Actian Matrix database instance.

Matrix Sink bulk loads data into an Actian Matrix database instance in workflows executed from Matrix SQL.

ORC File Writer

ORC File Writer writes records in ORC format files.

PMML Writer

PMML Writer writes a PMML model to a file.

Staging Exporter

Staging Exporter exports a DataFlow staging dataset.

Actian Vector Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LoadActianVector Operator to Load Actian Vector.

The Load Actian Vector node enables loading bulk data to an Actian Vector database instance. The loader supports the following methods for loading the data:

• The vwload utility is limited to executing on the Actian Vector server instance.

• The SQL copy method requires the Actian Vector client to be installed and allows running remotely.

• The DIRECT load utility does not require any additional client to be installed and can load remotely. The DIRECT load utility is the fastest method for loading data into Vector.

For more information about methods for loading the data, see Using the LoadActianVector Operator to Load Actian Vector.

Connections to Actian Vector must be set up using the File > Preferences > Actian Vector Connections page in DataFlow. After a connection is configured, you may load the data by selecting it from the connection list in the configuration window. To configure a Vector connection, see Actian Vector Connections.

By default, the loader maps the source fields to the target database fields by position. If few input fields are available, then the unmapped database fields are loaded with null values. If any of the unmapped fields cannot have a null value, then a loader error occurs. To manually map the source fields to the target table fields, click Map Fields to open a mapper window. You can click and drag source fields to a target database column or click any of the map options to map by name or position. You also can select the source field by using the list available on each target field.

When the loader is executed, the input data is transformed into loader format in a temporary data space. After the transformation is complete, the staged loader files are loaded into the database. When the load finishes, the temporary files are removed and the output of the load command is logged at the INFO level. Log entries can be viewed in the console or by browsing the log file.

Dialog Options

Connection

Selects an already configured connection to an Actian Vector database.

To configure a Actian Vector connection, see Actian Vector Connections.

Table Name

After a connection to an Actian Vector database is established, you can select a target table. Alternatively, click Create Table to execute a CREATE TABLE command in SQL.

Create Table

Opens a window that allows the execution of SQL commands using the configured connection.

User Name

Enter the user name of the Actian Vector database account with access to write the target table.

Note: This is optional for the vwload/DIRECT load utility.

Password

Enter the password for the database user account.

Note: This is optional for the vwload/DIRECT load utility.

Use Direct load

Select this option to use the DIRECT load utility. This option can be used from any machine.

Use vwload

Select this option to use the vwload utility. This option is available only on Actian Vector server instances.

Use SQL copy

Select this option to use the SQL copy method. This option requires Actian Vector client installation.

Rollback on

Enables rollback on errors.

Note: This option is not supported for the DIRECT load utility.

Rollback off

Disables rollback on error.

Note: This option is not supported for the DIRECT load utility.

Pre-clean Data

Enable data verification and cleaning. If this is enabled additional operations and checks may be performed on the data before loading it into Vector to ensure it meets any table constraints or type limits. All invalid values will be loaded as nulls if the table allows instead of producing errors, such as the various truncation errors.

Truncation Errors

Select String to throw an error if a String field would be truncated to fit into a table column.

Select Decimal to throw an error if a Decimal field would lose scale to fit into a table column.

Max Errors

Indicates the maximum number of errors allowed before rollback. Default: zero.

Initialization SQL

Provides the SQL statement to execute before processing any records.

Finalization SQL

Provides the SQL statement to execute after processing all records.

Field Mapping

Click Map Fields to display a window that allows mapping from the source data fields to the target table columns. The following buttons are available between the source and target tables:

• Map by name: Maps the source fields to target columns where the names match.

• Map by position: Maps the source fields to target columns by position. The first input field is mapped to the first target column, and so on.

• Clear: Clears any mapping already completed.

To map a specific source field to a specific target column, click and drag the source field name to the source field in the target table (to the dataset field). If a field is not included in the mapping, then that field is not included in the data written to the database.

You also may customize the SQL type if a more specific type is required other than the default. This is the type used when configuring tables that do not exist or that are dropped and recreated.

Specific Load Options

Direct Insert Mode

Select the insert mode statement that applies for insert and merge operations. Default: DEFAULT

Note: This option can be used only with DIRECT load.

Buffer Size

Specifies the number of records to store in memory before sending them to Actian Vector. Typically, if the buffer size is larger, performance is enhanced.

However, large buffer size requires large native memory. For a table with wide varchar columns (>10000), use a smaller value (in 1000s). Default: 50000.

Note: This option can be used only with DIRECT load.

SSH User Name

Specifies the operating system user ID used to make the SSH connection to the Vector leader node. If not provided in this node configuration, the user entered in File > Preferences > Actian > Actian Vector Connections is used, and accordingly must exist as part of an operating system account.

Note: This option can be used only with vwload.

SSH Password

Specifies the password for the SSH user ID. If not provided in this node configuration, the password entered in File > Preferences > Actian > Actian Vector Connections is used, and accordingly must exist as part of an operating system account.

Note: This option can be used only with vwload.

Charset

Specifies the character set used when staging the data that will be loaded.

Note: This option can be used only with vwload.

Temp Directory

Specifies a directory on the local machine or within HDFS on the machine with the Vector instance that will be used to store intermediate files for data loading. The files must be a valid URI pointing to an existing directory in the system.

Note: This option can be used only with vwload or SQL copy.

This node can be used to load the Actian Vector database version 4.2 and later.

Ports

Input Ports

0 - Source data

ARFF Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteARFF Operator to Write Sparse Data.

This writer supports the Attribute-Relation File Format or ARFF. ARFF files are convenient in that they embed metadata within the data stream. The metadata defines the schema of the data including the field names, types, and positions.

When writing an ARFF file you can specify the data format: either sparse or dense. Sparse mode is very compact when used with sparse data. However, when used with dense data the overhead can cause waste of file space. Dense mode is best used when the data is not sparse. Dense format closely resembles comma-separated values (CSV) format.

Dialog Options

Target

location

Specifies the target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location. In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.

single output

Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input data set, although no guarantees are made as to which file any given record will be written.

write file locally instead of in cluster

Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.

Settings

relation name

(Read-only) Displays the @relation attribute of the ARFF metadata.

value delimiter

data format

Specifies how records should be represented: densely or sparsely.

comments

(Read-only) Shows comments found in the ARFF metadata.

Encoding (Advanced)

character set

Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.

Buffers (Advanced)

write buffer

Specifies the size (in KB) of the buffer to use for performing I/O.

encode buffer

Specifies the size (in KB) of the buffer to use for encoding character data into bytes.

Ports

Output Ports

0 - DataFlow dataset

Avro Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteAvro Operator to Write Apache Avro Data.

This writer supports the Apache Avro format. The Avro format is convenient because it embeds metadata within the data stream. The metadata defines the schema of the data, including the field names, types, and positions. Additionally, it is designed for parallelized writing and performs a level of data compression on its encoded data.

Dialog Options

Target

location

The target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.

single output

write file locally instead of in cluster

Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.

Settings

compression algorithm

The compression algorithm to use when writing data:

• none indicates no compression

• deflate indicates DEFLATE/gzip compression

• bzip2 indicates bzip2 compression

• snappy indicates Snappy compression

Buffers (Advanced)

write buffer

Specifies the size (in KB) of the buffer to use for performing I/O.

Ports

Input Ports

0 - DataFlow dataset

Database Delete

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeleteFromJDBC Operator to Write Database Deletes.

This node deletes data rows in the database based on the selected columns from the input. This node allows setting the fields used as keys to find matching rows for the delete. When configuring the schema used to map the input fields to the database, any field that is mapped is used as a key field. It is important to note that it is not required to map every field if only a subset of the input fields are required. These key fields are used to find matching rows to delete in the database. Additionally, the node produces an output table containing the data used for the delete and an additional column with the number of rows in the database affected by each row used in a delete statement.

Configuration

Configuration consists of the following steps:

1. Select a preconfigured JDBC connection. To configure a connect, see Managing JDBC Connections.

2. Select a valid table name.

3. Configure the source to target mapping as needed.

Dialog Options

Connection

Selects a preconfigured JDBC connection. To configure a new connection, see Managing JDBC Connections.

Table Name

Specifies the name of the table to delete rows from.

Isolation Level

Specifies the JDBC transaction isolation level to use when performing deletes.

Commit Interval

Specifies the frequency (in rows) with which the transaction will be committed.

Map Fields

Opens a dialog that allows the customization of the source-to-target field mappings.

Ports

Input Ports

0 - DataFlow dataset

Output Ports

0 - The data sent to the database with the returned status codes

Database Update

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the UpdateInJDBC Operator to Write Database Updates.

This node will update data rows in the database with the data values in the selected columns from the input. This node allows setting the fields that will be updated in addition to the fields that will be used as keys to find matching rows for the update.

When configuring the schema that will be used to map the input fields to the database, any field that is mapped may be used as a key field or an update field. It is important to note that it is not required to map every field if only a subset of the input fields is required; however, a minimum of at least one key and update field must be specified when configuring the mapping. These key fields will be used to find matching rows that will then have the specified update fields updated.

Additionally the node will produce an output table containing the data used for the update and an additional column with the number of rows affected by each row used in an update statement.

Configuration

Configuration consists of the following steps:

1. Select a preconfigured JDBC connection. To configure a connection, see Managing JDBC Connections.

2. Select a valid table name.

3. Configure the source to target mapping as needed.

Dialog Options

Connection

Selects a preconfigured JDBC connection. To configure a new connection, see Managing JDBC Connections.

Table Name

Specifies the name of the table to update.

Isolation Level

Specifies the JDBC transaction isolation level to use when performing updates.

Commit Interval

Specifies the frequency (in rows) with which the transaction will be committed.

Map Fields

Opens a dialog that allows the customization of the source to target field mappings and allows setting the key and update fields.

Ports

Input Ports

0 - DataFlow dataset

Output Ports

0 - The data sent to the database with the returned status codes

Database Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteToJDBC Operator to Write to Databases.

Writes a dataset to a relational database table.

Configuration

Configuration consists of the following steps:

1. Select a preconfigured JDBC connection. To configure a connection, see Managing JDBC Connections.

2. Select a valid table name or use the Create Table button to create a table that you can select.

3. Configure the schema as needed, ensuring the correct datatypes are being used for the database.

By default, the loader maps source fields to the target database fields by position. If too few input fields are present, the unmapped database fields will be loaded with null values. If any of the unmapped fields are nonnullable, a loader error will occur.

To manually map the source fields to the target table fields, click the Map Fields button. A mapper dialog will appear. The source fields can be dragged and dropped onto a target database column. Alternatively one of the map buttons can be used to map by name or position. The source field can also be selected using the dropdown list available on each target field. The SQL type can also be customized if a more specific type than the default is desired.

Dialog Options

Connection

Selects a preconfigured JDBC connection. To configure a new connection, see Managing JDBC Connections.

Table Name

After a connection is established, a target table can be selected. Alternatively, use the Create Table button to execute a create table command in SQL.

Create Table

Opens a dialog that allows the execution of SQL commands using the configured connection.

Isolation Level

Specifies the JDBC transaction isolation level to use when performing updates.

Commit Interval

Specifies the frequency (in rows) with which the transaction will be committed.

Map Fields

Opens a dialog that allows the customization of the source to target field mappings.

Initialization SQL

Specifies the SQL statement to execute before processing any records.

Finalization SQL

Specifies the SQL statement to execute after processing all records.

Ports

Input Ports

0 - DataFlow dataset

Delimited Text Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteDelimitedText Operator to Write Delimited Text.

Specifies a node for writing a text file of delimited records. Records are separated by a known character sequence. Typically, new lines are used, either Windows-style (CRLF) or UNIX-style (LF). Unless otherwise specified, the system-appropriate style is used.

Delimited text supports the following separators used to identify field boundaries:

• A field separator, found between individual fields

• Field delimiters, marking the beginning or end of a field value

The field separator cannot be empty. The field delimiter can be empty. It is not expected that all fields start and end with a delimiter, although if you start with a delimiter, you must end with one. Fields containing significant characters, such as whitespace and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the field delimiter, it must be escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.

Delimited files may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. These names come from the field names of the input data.

Dialog Options

Target

location

Specifies the target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for non-local filesystems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.

single output

write file locally instead of in cluster

Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.

Settings

record separator

Specifies the record separator, marking the end of a record. Common separators are selectable, including:

• automatically using the system default

• UNIX-style (LF)

• Windows-style (CRLF)

Other non-empty sequences may also be entered as the record separator.

field separator

Specifies the field separator that separates the individual fields. Common separators are selectable from the list, although any non-empty sequence may also be entered as the field separator.

field delimiters

line comment

Specifies the line comment marker, indicating the line is a comment, not a record. Comments are only detected at the start of a record, after a record separator or at the beginning of the file.

header row

Indicates whether the input contains a header row.

Encoding (Advanced)

character set

Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.

Schema (Advanced)

null indicator

Specifies a string used to represent NULL valued fields in the output.

date format

Specifies the format to use for date-valued fields in the output. This may either be the default ISO-8601 date (yyyy-MM-dd format) used by DataFlow or a user-specifed format.

timestamp format

Specifies the format to use for timestamp-valued fields in the output. This may either be the default ISO-8601 time stamp (yyyy-MM-dd'T'hh:mm:ss.SSSZ format) used by DataFlow or a user-specifed format.

Buffers (Advanced)

write buffer

Specifies the size in kilobytes of the buffer to use for performing I/O.

encode buffer

Specifies the size in kilobytes of the buffer to use for encoding character data into bytes.

Metadata (Advanced)

Save metadata with data

Specifies whether metadata about the ordering and distribution should be saved with the data.

Ports

Input Ports

0 - DataFlow dataset

HBase Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteHBase Operator to Write to HBase.

This node allows writing to Apache HBase.

Cluster

The first step in configuring the node is to specify an HBase Cluster and at least one mapping method. You may select any cluster that you have configured in the HBase Clusters preferences page.

A mapping may be provided by hand by specifying an HCatalog database and table, or both.

• To write to HBase without using HCatalog, provide a field mapping manually.

• To create a new HCatalog table, provide both a field mapping and an HCatalog database and table; the schema you provide will be written to HCatalog.

• To write to a table already in HCatalog, simply specify the HCatalog database and table.

After you have specified a cluster and mapping method, click Connect to connect to the HBase cluster. Proceed to the field mapping or the HCatalog table selection.

Field Mapping

After a connection to the cluster is established, a list of available tables will be loaded. Select the table you would like to write, or enter the name of a new table, and click Get Schema. This will load available schema information so you can map HBase qualifiers to input fields.

If the table was previously written by the DataFlow HBase Writer, the reader will be able to retrieve schema information about that table, including qualifiers and their types. This information will be automatically loaded into the schema mapping list, where you can assign an input field to each qualifier. You may also manually add new qualifiers using the Add Qualifier button.

If the table has not been written by the DataFlow HBase Writer, the only information available will be the table’s families. Select a family and click Add Qualifier to add a new qualifier to write. After a qualifier is added, you can specify an input field name.

If the table does not exist, you will have to manually add the families and qualifiers you want to write. Click Add Family to add a new family. Within each family, click Add Qualifier to add new qualifiers to that family. After you have added families and qualifiers, you can specify input fields for each qualifier.

Note: When writing to a new table, the table will be created and its schema locked in once you click OK or Apply in the configuration window. Ensure you have added all the families you will need before you click OK or Apply.

In the DataFlow HBase Writer, new qualifiers can be created as either a scalar type or a record type. Scalar types store a single value in a cell, which can be read by any HBase client. Record types allow you to store multiple fields in a single HBase cell. Using record types increases the space and time efficiency, as only a single lookup is needed to read or write an entire set of fields; however, they are stored in a format unique to DataFlow, and the schema for these records remains fixed once they are created.

To create a record-type qualifier, click Add Qualifier, and when prompted for the type, select Record. After you have added the record-type qualifier to the schema, select it and click Add Field to add subfields to the record. You can then specify input fields for each subfield.

Note: After you have created a new record-type qualifier, its schema cannot be modified. Ensure you have added all the subfields you will need before you click OK or Apply.

In addition to families and qualifiers, you can also map the row ID and row timestamp to an input field. In most cases, both of these are optional: if the row ID is not specified, a unique ID will be generated for each row; if the row timestamp is not specified, the actual insert time will be used. However, when using HCatalog, you must specify a row ID.

HCatalog

After a connection to the cluster is established, a list of available HCatalog databases are loaded. Select the database, and a list of tables in that database is loaded. Select the HCatalog table you want to write to , or enter the name of a new table to create.

Dialog Options

HBase Master

HBase Cluster

Specifies the HBase cluster to which to connect. These are configured in the HBase Clusters preferences page.

Field Mapping

Select this option if you want to specify a field mapping.

HCatalog

Select this option if you want to write to a table in HCatalog.

Connect

Connects to the specified HBase master. This connection must be established before further configuration options can be set.

Cancel

Cancels the connection attempt in progress.

Disconnect

Disconnects from the HBase cluster to select a new one.

Table

Specifies the table to write.

Get Schema

Queries HBase for schema information about the selected table. This schema information must be retrieved before further configuration options can be set.

If the table was previously written by DataFlow, a full schema, with families, qualifiers, subfields, and types will be available.

If the table has not been written by DataFlow, only the families will be available. You will need to specify the qualifiers to be written manually.

If the table does not exist, you will need to specify the families and qualifiers manually.

Cancel

Cancels the schema retrieval attempt in progress.

Mapping Table

Source HBase Cell

Specifies the name of the family, qualifier, or subfield in HBase.

Type

Specifies the type of the cell.

Include

Indicates whether the corresponding cell should be written by the node.

Source Field Name

Specifies the name of the input field which will contain data from the corresponding cell.

(Row ID)

If Include is selected for this field, its value will be used as the HBase row key; otherwise, an automatically generated key will be used.

(Timestamp)

If Include is selected for this field, its value will be used as the HBase timestamp; otherwise, the actual insertion time will be used.

Add Family

Adds a family. A dialog will prompt you for the name of the family. Note that you can only add a family when first creating a new table.

Add Qualifier

Adds a qualifier to the currently selected family. A dialog will prompt you for the name of the qualifier, as well as the type. Scalar types store a single value, while record types store multiple fields in a single cell.

Add Field

Adds a subfield to currently selected record-type qualifier. A dialog will prompt you for the name of the subfield. Note that you can only add a subfield when first creating a new record-type qualifier.

Rename

Renames the selected family or qualifier. A dialog will prompt you for the new name.

Remove

Removes the selected family or qualifier.

HCatalog

Database

Specifies the HCatalog database from which to read or write the table schema.

Table

Specifies the HCatalog table from which to read or write the table schema.

Ports

Input Ports

0 - DataFlow dataset

Load Actian Matrix

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LoadMatrix Operator to Load Actian Matrix.

The Load Actian Matrix node bulk loads data into a Matrix database instance. The loader directly loads data into Matrix using an ODI connector. Loading supports parallel and distributed execution (on a cluster).

Connections to Actian Matrix must be preconfigured using the Actian Matrix Connections preference page in the Actian preferences section. After a connection is configured, it can be used to load data by selecting it from the connection list in the configuration dialog. To configure a Matrix connection, see Actian Matrix Connections.

By default, the loader maps source fields to the target database fields by position. If too few input fields are present, the unmapped database fields are loaded with null values. If any of the unmapped fields are nonnullable, a loader error occurs.

To manually map the source fields to the target table fields, click the Map Fields button to open a mapper dialog. You can click and drag source fields to a target database column, or click one of the map buttons to map by name or position. The source field can also be selected using the list available on each target field.

When the load finishes, the output of the load command is logged at the INFO level. Log entries can be viewed in the console or by browsing the log file.

Network communication between DataFlow and the Matrix database is accomplished using TCP/IP sockets. The properties startPortRange and endPortRange can be used to limit the range of TCP/IP ports utilized for this communication. The ports within this range need to be available for two-way communication. In a firewall environment, the ports must be opened within the firewall configuration. The default port range is from 40000 to 41000 (exclusive).

When you run multiple workflows concurrently or run workflows with multiple Load Matrix operators, ensure the port range is large enough to allow at least one open port per slice per operator. Also, enure Matrix database workload management is configured to allow enough concurrent queries to leave at least one available for each active Load Matrix operator. This availability is configured using the max_concurrent_queries property in ~paraccel/padb/rel/etc/padb_wlm.xml on the Matrix server.

You may configure a DataFlow service class with specific properties for DataFlow jobs, which will allow the loader to check for available resources before launching a job and fail quickly if there are insufficient resources on the Matrix server; without this service class, a job may hang when there are insufficient resources. For more information about configuring this service class, see Installing and Configuring DataFlow for Matrix On-Demand Integration (ODI).

Data is loaded into Matrix in parallel. A data connection into Matrix is created for each parallel stream within DataFlow. The number of data connections can be limited by setting the max data connections property (see below). Setting this value may cause a redistribution of the data before loading. Consider setting the overall parallelism for the loader job to the desired number of connections to prevent data redistribution.

Note: Setting the maxConnections property may incur the overhead of data redistribution. Data redistribution occurs to gather data from a higher number of data streams to a lower number of data streams within DataFlow.

Dialog Options

Connection

Selects a preconfigured connection to a Paraccel MPP database. To configure a Paraccel MPPconnection, see Actian Matrix Connections.

Table Name

After a connection is established, you can select a target table. Alternatively, use the Create Table button to enter a CREATE TABLE command to execute in SQL.

Create Table

Opens a dialog that allows the execution of SQL commands using the configured connection.

Log Frequency

Specifies the frequency to log loader progress. Defaults to zero (no logging).

Trace Level

Specifies the trace (or debug) level used within the database for the load operation. Defaults to zero (no trace logging). After loading, if trace logging is enabled, query the STL_UDF_TRACE table to view the trace logs.

Initialization SQL

Specifies the SQL statement to execute before processing any records.

Finalization SQL

Specifies the SQL statement to execute after processing all records.

Start of Port Range

Sets the starting port in a range of network ports allocated for DataFlow to Matrix communication.

End of Port Range

Sets the last port in a range of network ports allocated for DataFlow to Matrix communication.

Max Data Connections

Specifies the maximum data connections desired for loading data. The default is a connection per DataFlow parallel stream.

Field Mapping

Click Map Fields to display a dialog that allows mapping from the source data fields to the target table columns. Three buttons are provided between the source and target tables:

• Map by name: Maps the source fields to target columns where the names match exactly.

• Map by position: Maps the source fields to target columns by position, that is, the first input field is mapped to the first target column and so on.

• Clear: Clears any mapping already completed.

To map a specific source field to a specific target column, click and drag the source field name to the source field in the target table (specifically the dataset field). If a field is not included in the mapping, it will not be included in the data written to the database.

You also may customize the SQL type if you want a more specific type than the default. This is the type used when configuring tables that do not exist or that will be dropped and recreated.

Ports

Input Ports

0 - Source data

Matrix Sink

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the MatrixSink Operator to Load Actian Matrix from SQL.

The Matrix Sink node loads large data into a Matrix database instance. This node is similar to Load Actian Matrix node. Refer to this node for more information about the process for loading large data.

The purpose of the Matrix Sink node is to embed in workflows that are executed from Matrix SQL. Workflows having the Matrix Sink node can be directly exported to a Matrix instance using the Exporting an Actian DataFlow Workflow. Execution of workflows from Matrix SQL requires the invoke_dataflow user-defined function (UDF) to be installed on the Matrix instance. For more information, see Installing and Configuring DataFlow for Matrix On-Demand Integration (ODI).

Note: Workflows that are executed from Matrix SQL must have only one instance of the Matrix Sink node.

Dialog Options

There is no defined configuration for the Matrix Sink node. All the required settings are provided to the node by the invoke_dataflow UDF before executing the workflow.

Ports

Input Ports

0 - Source data

ORC File Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteORC Operator to Write Apache ORC Files.

This node can write records in Apache Optimized Row Columnar (ORC) format files.

Dialog Options

Target

Location

Specifies the target location into which to write the output as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.

In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.

Note: Do not select append mode as the write mode. If selected, an exception within KNIME is thrown when the configuration is saved or applied from the dialog.

Write single output file

Write file locally instead of in cluster

Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.

Settings

Block padding

Switches block padding in ORC format on or off.

Buffer size

Specifies the size of the ORC writer output buffer.

Compression algorithm

Specifies the compression format to use within the ORC file.

ORC version

Specifies the version of the ORC file format to use.

Row index stride

Specifies the distance between entries in the ORC row index. This value must be >= 1000. If disabled, no index is created.

Stripe size

Specifies the maximum size of a stripe inside the ORC file.

Port

Input Port

0 - Input data-set

PMML Writer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WritePMML Operator to Write PMML.

Writes a PMML model to a file. If you are using KNIME version 2.10 or later, then PMML version 4.2.x is supported. If you are using KNIME 2.9.x, then PMML version 4.0 or 4.1 is supported. For PMML definition, visit the website Data Mining Group.

Ports

Input Port

0 - DataFlow PMML Model

Staging Exporter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the WriteStagingDataset Operator to Write Staging Data Sets.

Exports a DataFlow staging dataset. This allows you to exchange datasets between KNIME workflows. In addition, the output format is the standard consumed by the DataFlow library operator ReadStagingDataset, so it can be easily and efficiently consumed by other applications built on the DataFlow library.

Dialog Options

Target

location

In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.

single output

Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input dataset, although no guarantees are made as to which file any given record will be written to.

write file locally instead of in cluster

Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.

Settings

staging format

Specifies the format to use for storing data. Compact row is most efficient for reading all columns and uses the least memory. Columnar is most efficient for reading a small subset of columns but at a cost of using more memory.

records per block

Specifies the number of records to store per format block. Larger values require more memory to encode blocks. There is generally no reason to alter this value from the default.

Buffers (Advanced)

write buffer

Specifies the size in kilobytes of the buffer to use for performing I/O.

Ports

Input Ports

0 - DataFlow dataset

Force Staging

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ForceRecordStaging Operator to Explicitly Stage Data.

Useful when creating a workflow intended for the DataFlow Executor to force staging to occur. This node forces partitioning of DataFlow graphs into smaller graphs at its placement (cut points). This is useful for preventing queue expansion that would otherwise occur.

Ports

Input Ports

0 - DataFlow dataset

Output Ports

0 - DataFlow dataset

Log Rows

KNIME This topic describes a KNIME node. It is useful for testing the behavior of other nodes during workflow development.

Logs information about the input data from a flow. The record type of the flow is logged. A log frequency is used to determine which data rows are logged. A data row is logged by outputting the value of each input field within the row. A format can be provided that specifies output format for each row log. The final row count is also logged.

Dialog Options

Format

Specifies the format to use when logging rows. See MessageFormat for more information about the syntax of formats. Two variables are passed to the format: the row count and the row contents. The default format is row {0} is {1}.

Log Frequency

Specifies the frequency of rows to log. Setting the frequency to 1 logs every row, a value of 2 logs every other row and so on. The default frequency of 0 turns off row logging.

Ports

Input Ports

0 - Input

Parse Text Fields

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ParseTextFields Operator to Parse Text Records.

Specifies a node for parsing records consisting of text fields. These records typically represent data which has been read from a file and broken into individual fields based on some criteria, but with no additional interpretation performed. This node can be used to interpret the text fields as configured data types, making the result suitable for use with other operators.

The provided schema is used to determine the output type. Input fields are parsed according to the format specified for the field of the same name in the output schema. Fields present in the input but not named in the output schema are dropped. Fields named in the output need not exist in the input; their value will be NULL in the output.

Dialog Options

Schema

The schema pane is used to define the structure of the output schema, identifying the field names and types.

Name

Defines the name of the selected field. Field names must be unique.

Type

Defines the data type of the selected field. Additional refinement of the type describing the allowed values and format will be editable in the Field Details, as is appropriate.

Default Null Indicator

Defines the default string value used to indicate a null in text. A number of commonly used options are predefined for selection.

Generate Schema

Replaces the current schema with a default one based on the available input fields.

Load Schema

Reads a schema from a file, replacing the current schema.

Save Schema

Writes the current schema to a file for future use. The schema must currently be in a valid state before it can be saved.

Field Details

The field details pane is used to refine the definition of the currently selected schema field.

Format

Defines the text formatting for the selected field, if allowed for the selected field type. A number of commonly used formats are predefined for selection. Formatting is type-dependent:

• Appropriate formats for dates and timestamps are any supported by SimpleDateFormat.

• Appropriate formats for numeric types are any supported by NumberFormat.

• Strings do not support custom formats, although they do support choosing whether whitespace should be trimmed.

If default formatting for the type should be used, select Use default format. For numeric types, default formatting is default Java formatting. For dates and timestamps, default formatting follows ISO-8601 formatting.

Null Indicator

Defines the string value indicating a null in the text for the selected field. A number of commonly used values are predefined for selection.

If the schema default should be used, select Use default null indicator.

Allowed Values

If the selected field type is enumerated, the possible values that the enumeration can take are specified here.

Ports

Output Ports

0 - Output-parsed data rows.

1 - Output data rows that could not be parsed successfully for all text fields.