I/O Nodes
The following are the DataFlow input and output nodes available in KNIME.
The following Read and Write nodes are available:
Read Nodes
ARFF Reader
ARFF Reader reads files in the Attribute-Relation File Format (ARFF).
Avro Reader
Database Reader
Database Reader reads from a relational database table, storing the results in a staging dataset.
Delimited Text Reader
HBase Reader
JSON Reader
Log Reader
MDF Reader
ORC File Reader
ORC File Reader reads records from Optimized Row Columnar (ORC) format files.
Parquet File Reader
PMML Reader
PMML Reader reads a PMML (Predictive Model Markup Language) model from a file.
Staging Importer
ARFF Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadARFF Operator to Read Sparse Data. The ARFF Reader node supports the Attribute-Relation File Format (ARFF). See the ARFF
Weka page for more information. ARFF supports both sparse and dense data formats. The format is detected automatically. ARFF contains a metadata section that defines the schema of the ARFF file. Hence there are very few options needed to configure the reader.
Dialog Options
Source
Location
Specifies the source location from which to read the input as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multifile read.
Read file locally
Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this checkbox should be selected.
Include source information in output
Indicates that output records should include fields identifying their original source and location within the source:
• sourcePath – a string naming the original source file from which the record originated.
• splitOffset – a long providing the starting byte offset of the the parse split in the file.
• recordOffset – a long providing the starting offset for the record within the parse split.
Together, these fields can be used to reconstruct the original source ordering.
Note: These fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.
Settings
Relation name
(Read-only) Displays the @relation attribute of the ARFF metadata.
Value delimiter
Specifies the character used to delimit field values when they contain characters with special meaning to ARFF. ARFF defines only two options for delimiting values: single quotes (') and double quotes (").
Data format
Specifies how records should be represented: densely or sparsely.
Comments
(Read-only) Shows comments found in the ARFF metadata.
Edit Schema
Selecting the edit schema button will bring up the configure schema dialog window. This dialog can also be accessed through the preview's context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.
Encoding (Advanced)
Character set
Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.
Buffers (Advanced)
Read buffer
Specifies the size (in KB) of the buffer to use for performing I/O.
Decode buffer
Specifies the size (in KB) of the buffer to use for decoding bytes into character data.
Splits (Advanced)
Do not split source files
Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.
Ports
Output Ports
0 - DataFlow dataset
Avro Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadAvro Operator to Read Apache Avro Data.
The Avro reader supports the
Apache Avro format. Avro files contain metadata defining the schema of the file. Therefore, very few options are needed to configure the reader.
Dialog Options
Source
Location
Specifies the source location from which to read the input as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multifile read.
Read file locally
Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this checkbox should be selected.
Include source information in output
Indicates that output records should include fields identifying their original source and location within the source:
• sourcePath - a string naming the original source file from which the record originated.
• splitOffset - a long providing the starting byte offset of the the parse split in the file.
• recordOffset - a long providing the starting offset for the record within the parse split.
Together, these fields can be used to reconstruct the original source ordering.
Note: These fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.
Settings
Compression algorithm
(Read-only) Displays the compression algorithm used when writing the file.
Edit Schema
Selecting the edit schema button will bring up the configure schema dialog window. This dialog can also be accessed through the preview context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.
Buffers (Advanced)
Read buffer
Specifies the size in kilobytes of the buffer to use for performing I/O.
Splits
Select Do not split source files to indicate whether parallelism should only be applied on each file. Typically, reading may be parallelized by using subsections (or splits) of a file. Select Use default split sizes to use the default size (1024 KB). If default split sizes are not used, then specify the split sizes in Target split size field.
Ports
Output Ports
0 - DataFlow dataset
Database Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadFromJDBC Operator to Read Databases.
Reads from a relational database table, storing the results in a staging dataset.
Click Fetch Metadata to get the information about all the visible tables. This will query the database to get all the schema information that is visible. It is useful while constructing the query.
Note: This operation may take longer on very large databases.
Configuration
Configuration consists of the following steps:
2. (Optional) Select or enter a fetch size.
Note: If you are using MySQL, then this setting is ignored, and the proper settings for efficient streaming are used implicitly.
3. Enter an appropriate query. Typical queries are of the form select * from tableName. More powerful queries are also supported to the extent that support is provided by the JDBC driver and the database to which you are connecting.
4. Click Refresh. This serves to validate your query and connection settings and also populates the schema metadata. You should refresh if the query or underlying table schema ever change.
5. Configure types of any strings columns that should be treated as enumerated types.
Dialog Options
connection
query
Specifies the SQL query to execute. The results of the query include the contents of the dataset.
parameter query
Specifies a query to execute that provides parameters for the parametrized data query.
fetch size
Specifies the fetch (or buffer) size to use. Specifies the number of rows the database driver must fetch each time from the server.
Note: For MySQL, this setting has no effect and is overridden by the proper setting for efficient streaming (Integer.MIN_VALUE).
schema
Configures the schema. This is automatically discovered when you click the Refresh button. In addition, you may choose to treat any of the string fields as enumerated fields by altering their type. In some cases when utilizing stored procedures in the query, the refresh button may not be able to discover the complete schema. In this situation the Edit Schema button will be enabled and can be used to manually modify the schema.
configure schema during execute
Selecting this option causes the node to determine the output type at time of execution. This option should only be used when the correct schema cannot be determined by refreshing. Generally this should only be used when calling stored procedures that dynamically generate the output type during execution. In this case the correct output type cannot be discovered without executing the full query during composition. This can potentially cause a loss of performance and should be used with great caution.
Ports
Input Ports
0 - Query parameters (optional)
Output Ports
0 - DataFlow dataset
Delimited Text Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadDelimitedText Operator to Read Delimited Text.
This node reads a text file of delimited records. Records are separated by new lines, represented either Windows-style (CRLF) or UNIX-style (LF). By default, the style is automatically detected.
Delimited text supports the following separators used to identify field boundaries:
• A field separator, found between individual fields
• Field delimiters, marking the beginning or end of a field value
The field separator cannot be empty. The field delimiter can be empty. It is not expected that all fields start and end with a delimiter, though if one starts with a delimiter it must end with one.
Fields containing significant characters, such as white space and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the field delimiter, it is escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.
Delimited files may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists or not. If it does, the reader will skip the header row; otherwise the first row is treated as a record and will appear in the output. Additionally, when discovering the schema of a delimited file with a header row, the field names will be taken from the header.
Data from the input file may be browsed within the configuration dialog. Once the field separator and other options are set, the fields within the input data should be parsed correctly and displayed in the data table within the configuration dialog.
An attempt will be made to identify the names and data types of each field. However, the names and data types of each field may be modified by right clicking on the column header of the field and following the menu items on the context menu. Select the option to configure the schema to modify the names and data types of fields.
The reader can tolerate some degree of error in the input file. How it reacts to errors is configurable. Typically, one of five actions can be taken when an error is discovered.
• IGNORE - Error is ignored; the field in question is assigned a value of NULL.
• DISCARD - Entire record is discarded.
• WARN - Error is ignored and a warning logged; the field in question is assigned a value of NULL.
• WARN AND DISCARD - Entire record is discarded and a warning logged.
• ERROR - Read fails.
Additionally, a limit can be placed on warnings, causing the read to fail if a certain threshold is reached.
When setting a field’s data type to enumerated, you have the option to discover the values of the enumeration from the input data. Click Start Discovery to begin the discovery process. The input data will be read, and each unique value of the selected field will become a category within the enumeration. You can either wait for the discovery to complete reading the whole files or click Stop Discovery to complete the operation. You should only stop the discovery process if you know that all of the categorical values have been discovered.
Dialog Options
Source
Location
Specifies the source location from which to read the input as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multi-file read.
Read file locally
Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this checkbox should be selected.
Include source information in output
Indicates that output records should include fields identifying their original source and location within the source:
• sourcePath - a string naming the original source file from which the record originated.
• splitOffset - a long providing the starting byte offset of the the parse split in the file.
• recordOffset - a long providing the starting offset for the record within the parse split.
Together, these fields can be used to reconstruct the original source ordering. Note that these fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.
Settings
Record separator
Specifies the record separator, marking the end of a record. Common separators are selectable, including:
• automatically using the system default
• UNIX-style (LF)
• Windows-style (CRLF)
Other non-empty sequences may also be entered as the record separator.
Field separator
Specifies the field separator that separates the individual fields. Common separators are selectable from the list, although any non-empty sequence may also be entered as the field separator.
Field delimiters
Specifies the field delimiters, marking the beginning or end of a field value. These can either be the same (as is usually the case) or different. An empty sequence can only be used if both delimiters are the same, indicating no field delimiters.
Line comment
Specifies the line comment marker, indicating the line is a comment, not a record. Comments are only detected at the start of a record, after a record separator or at the beginning of the file.
Header row
Indicates whether the input contains a header row. If multiple files are being read, it is expected that either all files or no files have a header row.
Edit Schema
After you have successfully connected to a data file, clicking Edit Schema opens the Configure Schema dialog. This dialog can also be accessed through the preview’s context menu. This window lets you view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to reuse elsewhere and set various defaults for each field individually.
Encoding (Advanced)
Character set
Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.
Errors (Advanced)
On missing field
Indicates the behavior of the read when an expected field is not found.
On field error
Indicates the behavior of the read when a field value cannot be parsed to the field's data type.
On missing field
Indicates the behavior of the read when an extra field is not found.
Maximum field length
Specifies the maximum length (in characters) permitted for a field before it is considered a field error.
Fail if record warnings exceed threshold
Specifies a hard limit on the number of records that generate warnings before terminating the read. Only errors that are configured to raise a warning are counted; IGNORE and DISCARD actions do not count against this limit. Multiple warnings for a record are only counted as one record warning.
Buffers (Advanced)
Read buffer
Specifies the size (in KB) of the buffer to use for performing I/O.
Decode buffer
Specifies the size (in KB) of the buffer to use for decoding bytes into character data.
Splits (Advanced)
Do not split source files
Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.
Schema Discovery (Advanced)
Number of characters to analyze
The size of the initial portion of the file to analyze for schema discovery purposes.
Null indicator
A string used to represent NULL valued fields in the discovered schema.
Date format
The format to use for discovered date valued fields. This may either be the default ISO-8601 date (yyyy-MM-dd format) used by DataFlow or a user-specifed format.
Timestamp format
The format to use for discovered timestamp-valued fields. This may either be the default ISO-8601 time stamp (yyyy-MM-dd'T'hh:mm:ss.SSSZ format) used by DataFlow or a user-specifed format.
Metadata (Advanced)
Use available metadata
Specifies whether discovered metadata about the ordering and distribution of the data should be used.
Ports
Output Ports
0 - DataFlow dataset
HBase Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadHBase Operator to Read from HBase.
This node allows reading from
Apache HBase.
Cluster
The first step in configuring the operator is to specify an HBase Cluster and mapping method. You may select any cluster that you have configured in the
HBase Clusters preferences page.
After you have specified a cluster and mapping method, click Connect to connect to the HBase cluster. Proceed to either the field mapping or the HCatalog table selection.
Field Mapping
After a connection to the cluster is established, a list of available tables will be loaded. Select the table you want to read and click Get Schema. This will load available schema information, so you can map HBase qualifiers to output fields.
If the table was previously written by the DataFlow HBase Writer, the reader will be able to retrieve schema information about that table, including qualifiers and their types. This information will be automatically loaded into the schema mapping list, where you can assign an output field to each qualifier. If there are any additional qualifiers you have added to the table outside of DataFlow, you may manually add them using the Add Qualifier button.
If the table has not been written by the DataFlow HBase Writer, the only information available will be the table’s families. Select a family and click Add Qualifier to add a new qualifier to read. After a qualifier is added, you can specify an output field name. These fields will have a binary type, which may be converted downstream using another operator.
HCatalog
After a connection to the cluster is established, a list of available HCatalog databases will be loaded. Select the database, and a list of tables in that database is loaded. Select the HCatalog table from which to read.
Time Range
You may limit the records returned to a certain timestamp range by selecting the Limit by Time Range option. When this option is selected, you may enter a start time and an end time in the corresponding fields.
Dialog Options
HBase Master
HBase Cluster
Specifies the HBase cluster to which to connect. These are configured in the
HBase Clusters preferences page.
Field Mapping
Select this option if you wish to specify a field mapping.
HCatalog
Select this option if you wish to read from a table in HCatalog.
Connect
Connects to the specified HBase master. This connection must be established before further configuration options can be set.
Cancel
Cancels the connection attempt in progress.
Disconnect
Disconnects from the HBase cluster in order to select a new one.
Table
Table
Specifies the table to read.
Get Schema
Queries HBase for schema information about the selected table. This schema information must be retrieved before further configuration options can be set.
If the table was previously written by DataFlow, a full schema, with families, qualifiers, subfields, and types will be available.
If the table was not written by DataFlow, only the families will be available. You will need to specify the qualifiers to be read manually, which will be read as binary types.
Cancel
Cancels the schema retrieval attempt in progress.
Mapping Table
Source HBase Cell
Specifies the name of the family, qualifier, or subfield in HBase.
Type
Specifies the type of the cell.
Include
Indicates whether the corresponding cell should be read by the operator.
Target Field Name
Specifies the name of the output field, which will contain data from the corresponding cell.
(Row ID)
Specifies the output field which will contain the HBase row key.
(Timestamp)
If Include is selected for this field, the specified output field will contain the HBase timestamp.
Add Qualifier
Adds a qualifier to read to the currently selected family. A dialog will prompt you for the name of the qualifier.
Rename
Renames the selected family or qualifier. A dialog will prompt you for the new name.
Remove
Removes the selected family or qualifier.
HCatalog
Database
Specifies the HCatalog database from which to read the table schema.
Table
Specifies the HCatalog table from which to read the table schema.
Limit by Time Range
Limit by Time Range
If selected, only rows with timestamps between the start and end time will be read.
Start Time
Specifies the beginning of the time range to read, such as Jan 1, 1970 12:00:00 AM.
End Time
Specifies the end of the time range to read, such as Dec 31, 2099 11:59:59 PM.
Ports
Output Ports
0 - DataFlow dataset
JSON Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadJSON Operator to Read JSON Text.
This node reads a text file of JSON lines records. For more information about the JSON lines format, see
http://jsonlines.org/. Records are stored in JSON text format with one record per line, separated by a newline separator character.
Data from the input file may be browsed within the configuration dialog. Once the various JSON parsing options are set, the fields within the input data should be parsed correctly and displayed in the data table within the configuration dialog. An attempt will be made to identify the names and data types of each field. However, the data types of each field may be modified by right-clicking on the column header of the field and following the menu items on the context menu. Select the option to configure the schema to modify data types of fields.
The reader can tolerate some degree of error in the input file. How it reacts to errors is configurable. Typically, one of five actions can be taken when an error is discovered:
• IGNORE - the error is ignored; the field in question is assigned a value of NULL.
• DISCARD - the entire record is discarded.
• WARN - the error is ignored and a warning logged; the field in question is assigned a value of NULL
• WARN AND DISCARD - the entire record is discarded and a warning logged.
• ERROR - the read fails.
Additionally, a limit can be placed on warnings, causing the read to fail if a certain threshold is reached.
When setting a fields data type to enumerated you will have the option to discover the values of the enumeration from the input data. Click on Start Discovery to begin the discovery process. The input data will be read and each unique value of the selected field will become a category within the enumeration. You can either wait for the discovery to complete reading the whole file(s) or click on Stop Discovery to complete the operation. You should only stop the discovery process if you know that all of the categorical values have been discovered.
Dialog Options
Source
Location
Specifies the source location from which to read the input as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multi-file read.
Read file locally
Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this checkbox should be selected.
Include source information in output
Indicates that output records should include fields identifying their original source and location within the source:
• sourcePath - a string naming the original source file from which the record originated.
• splitOffset - a long providing the starting byte offset of the the parse split in the file.
• recordOffset - a long providing the starting offset for the record within the parse split.
Together, these fields can be used to reconstruct the original source ordering. Note that these fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.
Settings
Enable Multiline Format
Determines whether parser will allow JSON records to span multiple lines.:
Allow Comments
Determines whether parser will allow use of Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not.
Allow Unquoted Field Names
Determines whether parser will allow use of unquoted field names.
Allow Single Quotes
Determines whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.
Allow Unquoted Control Chars
Determines whether parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).
Allow Backslash Escaping Any
Determines whether parser will allow quoting of all characters using backslash quoting mechanism. If not enabled, only characters that are explicitly listed by JSON specification can be escaped.
Allow Numeric Leading Zeros
Determines whether parser will allow numbers to start with additional zeroes. If leading zeroes are allowed for numbers in source, this field should be set to true.
Allow Non Numeric Numbers
Determines whether the parser will recognize "Not a Number" (NaN) token as legal floating number values.
Edit Schema
After you have successfully connected to a data file, clicking Edit Schema opens the Configure Schema dialog. This dialog can also be accessed through the preview’s context menu. This window lets you view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to reuse elsewhere and set various defaults for each field individually.
Encoding (Advanced)
Character set
Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.
Errors (Advanced)
On missing field
Indicates the behavior of the read when an expected field is not found.
On field error
Indicates the behavior of the read when a field value cannot be parsed to the field's data type.
On missing field
Indicates the behavior of the read when an extra field is not found.
Maximum field length
Specifies the maximum length (in characters) permitted for a field before it is considered a field error.
Fail if record warnings exceed threshold
Specifies a hard limit on the number of records that generate warnings before terminating the read. Only errors that are configured to raise a warning are counted; IGNORE and DISCARD actions do not count against this limit. Multiple warnings for a record are only counted as one record warning.
Buffers (Advanced)
Read buffer
Specifies the size (in KB) of the buffer to use for performing I/O.
Decode buffer
Specifies the size (in KB) of the buffer to use for decoding bytes into character data.
Splits (Advanced)
Do not split source files
Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.
Schema Discovery (Advanced)
Number of characters to analyze
The size of the initial portion of the file to analyze for schema discovery purposes.
Null indicator
A string used to represent NULL valued fields in the discovered schema.
Date format
The format to use for discovered date valued fields. This may either be the default ISO-8601 date (yyyy-MM-dd format) used by DataFlow or a user-specifed format.
Timestamp format
The format to use for discovered timestamp-valued fields. This may either be the default ISO-8601 time stamp (yyyy-MM-dd'T'hh:mm:ss.SSSZ format) used by DataFlow or a user-specifed format.
Ports
Output Ports
0 - DataFlow dataset
Log Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadLog Operator to Read Log Data.
The Log Reader specifies a node for reading log data files. The various log types can accept a format specifier string in the same format as would be accepted by the particular log producer unless otherwise specified. The output is determined by the format and the logs being read.
The minimum configuration required is setting the log type property and the source, although some formats may additionally require the pattern property be set to produce meaningful output. The node will by default attempt to determine the newline character used by the log files, however this can be specified if known in advance.
Log entries that cannot be parsed will not be present in the output while fields that cannot be parsed will be null valued.
Dialog Options
Source
Location
Specifies the source location from which to read the input. This is specified as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the filesystem, also indicating a multi-file read.
Read file locally
Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this checkbox should be selected.
Include source information in output
Indicates that output records should include fields identifying their original source and location within the source:
• sourcePath - a string naming the original source file from which the record originated.
• splitOffset - a long providing the starting byte offset of the the parse split in the file.
• recordOffset - a long providing the starting offset for the record within the parse split.
Together, these fields can be used to reconstruct the original source ordering. Note that these fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.
Settings
Log Type
Determines the log type of the source. For more information about the available log types, see
Reading Log Data.
Log Pattern
Specifies the formatting pattern that will be used when parsing the log. Valid patterns are determined by the type of the log. See documentation on specific log types for more information.
Newline
Defines the newline character used in the source. If set to automatic the reader will attempt to automatically infer the newline characters used.
Discovery
If this is enabled, format discovery be attempted if the log type supports it. It may allow discovery of the schema or other metadata. See documentation on specific log types for more information.
Edit Schema
Clicking Edit Schema will bring up the configure schema dialog window. This dialog can also be accessed through the preview's context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.
Encoding (Advanced)
Character set
Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.
Errors (Advanced)
On missing field
Indicates the behavior of the read when an expected field is not found.
On field error
Indicates the behavior of the read when a field value cannot be parsed to the field's data type.
On expected field
Indicates the behavior of the read when an extra field is not found.
Maximum field length
Specifies the maximum length (in characters) permitted for a field before it is considered a field error.
Fail if record warnings exceed threshold
Specifies a hard limit on the number of records which generate warnings before terminating the read. Only errors that are configured to raise a warning are counted; IGNORE and DISCARD actions do not count against this limit. Multiple warnings for a record are only counted as one record warning.
Buffers (Advanced)
Read buffer
Specifies the size in kilobytes of the buffer to use for performing I/O.
Decode buffer
Specifies the size in kilobytes of the buffer to use for decoding bytes into character data.
Splits (Advanced)
Do not split source files
Indicates whether parallelism can only be applied on a per-file basis. Typically, reading may be parallelized using subsections (or splits) of a file.
Ports
Output Ports
0 - DataFlow dataset
MDF Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadMDF Operator to Read MDF Files.
The MDF File Reader node enables reading records from MDF format files. It can read data previously written using
MDF format. Currently only MDF version 4.0 or greater is supported. For more information, see
https://www.asam.net/standards/detail/mdf/wiki. This node will read the selected records with the specified ID from a single data group within the file.
Dialog Options
Input file location
The source location from which to read the input. This is specified as a DataRush-style file path, supporting URI-like syntax for non-local file systems such as HDFS. You may use the Browse… button to browse the local and any configured remote file systems for a location.
Run Mode
The mode that decides what will be extracted from the MDF file.
• DATA - Extracts the data from the specified record.
• ATTACHMENT - Extracts the raw binary data of any attachments.
• METADATA - Extracts the metadata from the file.
Data Group
The data group that will be scanned for the specified record ID.
Record ID
The channel group within the selected data channel that will be read. The channel group is referenced by its record ID, or optionally by 0 if it is the only channel group.
Apply Conversions
Specifies whether to apply any conversion rules supplied with the raw data.
Ports
Output Ports
0 - DataFlow dataset
ORC File Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadORC Operator to Read Apache ORC Files.
A node for reading records from ORC format files.
Note: For Hive-generated ORC files, the column names are stored as _col0, _col1, ... by default. The same column names are displayed using the ORC file reader operator.
Dialog Options
Input file or directory
The path to a file or a directory to read. If a directory is selected all files within the directory will be read as ORC files of an equivalent schema.
Port
Output Port
0 - DataFlow dataset
Parquet File Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadParquet Operator to Read Apache Parquet Files.
This node enables the reading of records from Parquet format files.
Dialog Options
Input file or directory
Indicates the path to a file or a directory that must be read. If a directory is selected, then all the files within the directory are read as Parquet files of an equivalent schema.
Port
Output Port
0 - DataFlow dataset
PMML Reader
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadPMML Operator to Read PMML.
Reads a Predictive Model Markup Language (PMML) model from a file. If you are using KNIME version 2.10 or later, then PMML version 4.2.x is supported. If you are using KNIME 2.9.x, then PMML version 4.0 or 4.1 is supported. For PMML definition, visit the website
Data Mining Group.
Note: You must use the DataFlow PMML reader when reading a model intended for DataFlow analytics. The standard KNIME PMML reader is incompatible with DataFlow analytics. This is because there are additional PMML elements used by DataFlow analytics that are not supported by the standard KNIME PMML reader. On the other hand, PMML produced by DataFlow analytics can be written using the standard KNIME PMML writer.
Ports
Output Ports
0 - DataFlow PMML Model
Staging Importer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ReadStagingDataset Operator to Read Staging Datasets.
Imports a DataFlow staging dataset. This allows you to exchange datasets between KNIME workflows. In addition, the input format is the standard produced by the DataFlow library operator WriteStagingDataset, so it can easily integrate with other applications built on the DataFlow library.
Dialog Options
Source
Location
Specifies the source location from which to read the input. This is specified as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
A directory can be specified as a source, in which case all files in the directory will be read. Wildcards may also be specified, depending on the file system, also indicating a multifile read.
Read file locally
Indicates that the read should be performed on the local machine. Otherwise, if the workflow is executed on a DataFlow cluster, it is assumed that the source path is accessible from the cluster. If this is not the case, this check box should be selected.
Include source information in output
Indicates that output records should include fields identifying their original source and location within the source:
• sourcePath - a string naming the original source file from which the record originated.
• splitOffset - a long providing the starting byte offset of the the parse split in the file.
• recordOffset - a long providing the starting offset for the record within the parse split.
Together, these fields can be used to reconstruct the original source ordering.
Note: These fields are part of the underlying schema of the source file and cannot be filtered, except by disabling this option.
Settings
Staging format
Specifies a read-only value showing the format of the input dataset.
Records per block
Specifies a read-only value showing the number of records stored per format block in the input dataset.
Edit Schema
Clicking Edit Schema opens the configure schema dialog window. This dialog can also be opened through the preview context menu. This allows you to view and customize the names, types, and inclusion of the fields present in the data. You can also save the schema to use later and set various defaults for each field individually.
Buffers (Advanced)
Read buffer
Specifies the size in kilobytes of the buffer to use for performing I/O.
Ports
Output Ports
0 - DataFlow dataset
Write Nodes
Actian Vector Writer
ARFF Writer
ARFF Writer writes data in Attribute-Relation File Format (ARFF).
Avro Writer
Database Delete
Database Update
Database Writer
Delimited Text Writer
HBase Writer
Matrix Loader
Matrix Sink bulk loads data into an Actian Matrix instance in workflows executed from Matrix SQL.
ORC File Writer
PMML Writer
Staging Exporter
Actian Vector Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the LoadActianVector Operator to Load Actian Vector.
The Load Actian Vector node enables loading bulk data to an Actian Vector database instance. The loader supports the following methods for loading the data:
• The vwload utility is limited to executing on the Actian Vector server instance.
• The SQL COPY method requires the Actian Vector client to be installed and allows running remotely.
• The DIRECT load utility does not require any additional client to be installed and can load remotely. The DIRECT load utility is the fastest method for loading data into Vector.
• The COPY VWLOAD method can be run remotely or exclusively over a JDBC connection.
For more information about methods for loading the data, see
Using the LoadActianVector Operator to Load Actian Vector.
For information about using Actian Vector Writer node to load data to Avalanche, see
Loading Delimited Text Data to Avalanche Using Actian Vector Writer Node.
This node can be used to load the Actian Vector database version 4.2 and later.Connections to Actian Vector must be set up using the
File >
Preferences >
Actian Vector Connections page in DataFlow. After a connection is configured, you may load the data by selecting it from the connection list in the configuration window. To configure a Vector connection, see
Actian Vector Connections.
By default, the loader maps the source fields to the target database fields by position. If few input fields are available, then the unmapped database fields are loaded with null values. If any of the unmapped fields cannot have a null value, then a loader error occurs. To manually map the source fields to the target table fields, click Map Fields to open a mapper window. You can click and drag source fields to a target database column or click any of the map options to map by name or position. You also can select the source field by using the list available on each target field.
When the loader is executed, the input data is transformed into loader format in a temporary data space. After the transformation is complete, the staged loader files are loaded into the database. When the load finishes, the temporary files are removed and the output of the load command is logged at the INFO level. Log entries can be viewed in the console or by browsing the log file.
Dialog Options
Connection
Selects an already configured connection to an Actian Vector database. To configure a Actian Vector connection, see
Actian Vector Connections.
Table Name
After a connection to an Actian Vector database is established, you can select a target table. Alternatively, click Create Table to execute a CREATE TABLE command in SQL.
Create Table
Opens a window that allows the execution of SQL commands using the configured connection.
User Name
Enter the user name of the Actian Vector database account with access to write the target table.
Note: This is optional for the vwload/DIRECT load utility.
Password
Enter the password for the database user account.
Note: This is optional for the vwload/DIRECT load utility.
Use Direct load
Select this option to use the DIRECT load utility. This option can be used from any machine.
Use vwload
Select this option to use the vwload utility. This option is available only on Actian Vector server instances.
Use SQL copy
Select this option to use the SQL copy method. This option requires Actian Vector client installation.
Rollback on
Enables rollback on errors.
Note: This option is not supported for the DIRECT load utility.
Rollback off
Disables rollback on error.
Note: This option is not supported for the DIRECT load utility.
Pre-clean Data
Enable data verification and cleaning. If this is enabled additional operations and checks may be performed on the data before loading it into Vector to ensure it meets any table constraints or type limits. All invalid values will be loaded as nulls if the table allows instead of producing errors, such as the various truncation errors.
Truncation Errors
Select String to throw an error if a String field would be truncated to fit into a table column.
Select Decimal to throw an error if a Decimal field would lose scale to fit into a table column.
Max Errors
Indicates the maximum number of errors allowed before rollback. Default: zero.
Initialization SQL
Provides the SQL statement to execute before processing any records.
Finalization SQL
Provides the SQL statement to execute after processing all records.
Field Mapping
Click Map Fields to display a window that allows mapping from the source data fields to the target table columns. The following buttons are available between the source and target tables:
• Map by name: Maps the source fields to target columns where the names match.
• Map by position: Maps the source fields to target columns by position. The first input field is mapped to the first target column, and so on.
• Clear: Clears any mapping already completed.
To map a specific source field to a specific target column, click and drag the source field name to the source field in the target table (to the dataset field). If a field is not included in the mapping, then that field is not included in the data written to the database.
You also may customize the SQL type if a more specific type is required other than the default. This is the type used when configuring tables that do not exist or that are dropped and recreated.
Specific Load Options
Direct Insert Mode
Select the insert mode statement that applies for insert and merge operations. Default: DEFAULT
Note: This option can be used only with DIRECT load.
Buffer Size
Specifies the number of records to store in memory before sending them to Actian Vector. Typically, if the buffer size is larger, performance is enhanced.
However, large buffer size requires large native memory. For a table with wide varchar columns (>10000), use a smaller value (in 1000s). Default: 50000.
Note: This option can be used only with DIRECT load.
SSH User Name
Specifies the operating system user ID used to make the SSH connection to the Vector leader node. If not provided in this node configuration, the user entered in File > Preferences > Actian > Actian Vector Connections is used, and accordingly must exist as part of an operating system account.
Note: This option can be used only with vwload.
SSH Password
Specifies the password for the SSH user ID. If not provided in this node configuration, the password entered in File > Preferences > Actian > Actian Vector Connections is used, and accordingly must exist as part of an operating system account.
Note: This option can be used only with vwload.
Charset
Specifies the character set used when staging the data that will be loaded.
Note: This option can be used only with vwload.
Temp Directory
Specifies a directory on the local machine or within HDFS on the machine with the Vector instance that will be used to store intermediate files for data loading. The files must be a valid URI pointing to an existing directory in the system.
Note: This option can be used only with vwload or SQL copy.
Use JDBC Only
Specifies that the load must be executed only with the JDBC connection.
Note: This option can be used only with copy vwload.
Ports
Input Ports
0 - Source data
Loading Delimited Text Data to Avalanche Using Actian Vector Writer Node
Actian Avalanche is a cloud-based version of Actian Vector. You can use Actian Vector Writer node to load data into Avalanche.
To load data into a table in the managed Avalanche data warehouse instance, perform the following steps:
Configuring Remote Staging Location
You can configure one of the following as the staging location:
• S3a location if Avalanche is on AWS. The AWS account is directly linked to the Avalanche instance.
Note: S3 bucket must be in the same region where you have Avalanche.
• ABFS location, which is linked to an Azure storage. The Azure storage account is directly linked to the Avalanche instance.
To configure the remote staging location:
1. Go to File > Preferences > Actian > Remote File Systems.
2. Click Add and add a staging location (AWS s3a or ABFS Azure). Make sure it is the root location.
3. If you have added a S3a location, then specify the
access key and
secret key configuration properties. For more information, see
Amazon S3.
4. If you have added a ABFS Azure location, then specify the following configuration properties:
• DefaultEndpointsProtocol: It is always https for Avalanche.
• Account Name
• Azure Client ID
• Azure Tenant ID
• Azure Client Secret
5. Click Apply and Close to save the settings.
Configuring Actian Vector Connection to Access Avalanche
To configure the Actian Vector connection to access Avalanche:
1. Go to File > Preferences > Actian > Actian Vector Connection and click Add.
The Vector Connection Information window is displayed.
Note: Actian Vector and Actian Avalanche are synonymous.
2. Specify the following information:
• Connection name: Unique name for the Avalanche connection.
• Host: Host name of the Avalanche instance. You can obtain this from the connection properties of the Avalanche managed service that is active.
• Instance: Avalanche JDBC port number.
• Database: Name of database where you want to load the data.
• User: User name to access the Avalanche database.
• Password: Password to access the Avalanche database.
• Verify Password: Retype password.
3. Click Advanced Properties.
The Advanced Vector Connection Properties window is displayed.
4. Click Add and add the encryption property with the value as on.
Also, deselect the Use Default Port option. Click OK to close the Advanced Vector Connection Properties window.
5. Click Test Connection to test the connection to Avalanche instance.
6. Click OK to close the Vector Connection Information window.
7. Click Apply and Close to save the Avalanche connection settings.
Creating Workflow
To create a DataFlow workflow file, see
Creating an Actian DataFlow Workflow.
Adding Nodes to Workflow
Use the Delimited Text Reader operator for reading the flat file and the Actian Vector Writer operator for writing the data to a table in the Avalanche instance.
1. To read your flat file:
a. In Node Repository, expand Actian DataFlow > I/O > Read.
b. Select Delimited Text Reader.
The node is displayed on the editor.
c. Double-click Delimited Text Reader.
d. On the Reader Properties tab, click Browse and specify the file that must be read.
The contents of the file are displayed in the Preview section.
e. Select the Has header row option.
2. To write the data to Avalanche instance:
a. In Node Repository, expand Actian DataFlow > I/O > Write.
b. Select Actian Vector Writer.
The node is added to the editor.
c. Connect Delimited Text Reader node to Actian Vector Writer node.
d. Double-click Actian Vector Writer.
e. On the Configuration tab, select the Connection you have setup.
f. In General Settings, specify the following:
• Table Name: Select the table where you want to load the data. If the required table is not available, then click Create table and create a table.
• Load Method: Select the load method as Use copy vwload.
Note: You do not have to provide User Name and Password because it is already added to the Avalanche configuration settings.
g. Click Load Options and specify the following:
• Copy Insert Mode: Select DEFAULT.
• Use JDBC only: Select this option.
• Charset: Use UTF8
• Temp Directory: Specify the configured S3 or ABFS location as the temporary staging location and add /temp at the end.
h. Click Map Fields. The Source to Target Mapper window is displayed.
i. Click Map by Position. The source fields are mapped to the target field based on position.
j. Click OK to close the Source to Target Mapper window.
k. Click Apply to apply the settings to the node and then click OK to close the window.
Running Workflow
To run the workflow, click the Execute all executable nodes icon.
You can see that the first node is reading the data and the second node is loading the data to the Avalanche cluster.
ARFF Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteARFF Operator to Write Sparse Data.
This writer supports the Attribute-Relation File Format or ARFF. ARFF files are convenient in that they embed metadata within the data stream. The metadata defines the schema of the data including the field names, types, and positions.
When writing an ARFF file you can specify the data format: either sparse or dense. Sparse mode is very compact when used with sparse data. However, when used with dense data the overhead can cause waste of file space. Dense mode is best used when the data is not sparse. Dense format closely resembles comma-separated values (CSV) format.
Dialog Options
Target
location
Specifies the target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location. In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.
single output
Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input data set, although no guarantees are made as to which file any given record will be written.
write file locally instead of in cluster
Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.
Settings
relation name
(Read-only) Displays the @relation attribute of the ARFF metadata.
value delimiter
Specifies the character used to delimit field values when they contain characters with special meaning to ARFF. ARFF defines only two options for delimiting values: single quotes (') and double quotes (").
data format
Specifies how records should be represented: densely or sparsely.
comments
(Read-only) Shows comments found in the ARFF metadata.
Encoding (Advanced)
character set
Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.
Buffers (Advanced)
write buffer
Specifies the size (in KB) of the buffer to use for performing I/O.
encode buffer
Specifies the size (in KB) of the buffer to use for encoding character data into bytes.
Ports
Output Ports
0 - DataFlow dataset
Avro Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteAvro Operator to Write Apache Avro Data.
This writer supports the
Apache Avro format. The Avro format is convenient because it embeds metadata within the data stream. The metadata defines the schema of the data, including the field names, types, and positions. Additionally, it is designed for parallelized writing and performs a level of data compression on its encoded data.
Dialog Options
Target
location
The target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for non-local file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.
single output
Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input data set, although no guarantees are made as to which file any given record will be written.
write file locally instead of in cluster
Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.
Settings
compression algorithm
The compression algorithm to use when writing data:
• none indicates no compression
• deflate indicates DEFLATE/gzip compression
• bzip2 indicates bzip2 compression
• snappy indicates Snappy compression
Buffers (Advanced)
write buffer
Specifies the size (in KB) of the buffer to use for performing I/O.
Ports
Input Ports
0 - DataFlow dataset
Database Delete
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the DeleteFromJDBC Operator to Write Database Deletes.
This node deletes data rows in the database based on the selected columns from the input. This node allows setting the fields used as keys to find matching rows for the delete. When configuring the schema used to map the input fields to the database, any field that is mapped is used as a key field. It is important to note that it is not required to map every field if only a subset of the input fields are required. These key fields are used to find matching rows to delete in the database. Additionally, the node produces an output table containing the data used for the delete and an additional column with the number of rows in the database affected by each row used in a delete statement.
Configuration
Configuration consists of the following steps:
2. Select a valid table name.
3. Configure the source to target mapping as needed.
Dialog Options
Connection
Table Name
Specifies the name of the table to delete rows from.
Isolation Level
Specifies the JDBC transaction isolation level to use when performing deletes.
Commit Interval
Specifies the frequency (in rows) with which the transaction will be committed.
Map Fields
Opens a dialog that allows the customization of the source-to-target field mappings.
Ports
Input Ports
0 - DataFlow dataset
Output Ports
0 - The data sent to the database with the returned status codes
Database Update
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the UpdateInJDBC Operator to Write Database Updates.
This node will update data rows in the database with the data values in the selected columns from the input. This node allows setting the fields that will be updated in addition to the fields that will be used as keys to find matching rows for the update.
When configuring the schema that will be used to map the input fields to the database, any field that is mapped may be used as a key field or an update field. It is important to note that it is not required to map every field if only a subset of the input fields is required; however, a minimum of at least one key and update field must be specified when configuring the mapping. These key fields will be used to find matching rows that will then have the specified update fields updated.
Additionally the node will produce an output table containing the data used for the update and an additional column with the number of rows affected by each row used in an update statement.
Configuration
Configuration consists of the following steps:
2. Select a valid table name.
3. Configure the source to target mapping as needed.
Dialog Options
Connection
Table Name
Specifies the name of the table to update.
Isolation Level
Specifies the JDBC transaction isolation level to use when performing updates.
Commit Interval
Specifies the frequency (in rows) with which the transaction will be committed.
Map Fields
Opens a dialog that allows the customization of the source to target field mappings and allows setting the key and update fields.
Ports
Input Ports
0 - DataFlow dataset
Output Ports
0 - The data sent to the database with the returned status codes
Database Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteToJDBC Operator to Write to Databases.
Writes a dataset to a relational database table.
Configuration
Configuration consists of the following steps:
2. Select a valid table name or use the Create Table button to create a table that you can select.
3. Configure the schema as needed, ensuring the correct datatypes are being used for the database.
By default, the loader maps source fields to the target database fields by position. If too few input fields are present, the unmapped database fields will be loaded with null values. If any of the unmapped fields are nonnullable, a loader error will occur.
To manually map the source fields to the target table fields, click the Map Fields button. A mapper dialog will appear. The source fields can be dragged and dropped onto a target database column. Alternatively one of the map buttons can be used to map by name or position. The source field can also be selected using the dropdown list available on each target field. The SQL type can also be customized if a more specific type than the default is desired.
Dialog Options
Connection
Table Name
After a connection is established, a target table can be selected. Alternatively, use the Create Table button to execute a create table command in SQL.
Create Table
Opens a dialog that allows the execution of SQL commands using the configured connection.
Isolation Level
Specifies the JDBC transaction isolation level to use when performing updates.
Commit Interval
Specifies the frequency (in rows) with which the transaction will be committed.
Map Fields
Opens a dialog that allows the customization of the source to target field mappings.
Initialization SQL
Specifies the SQL statement to execute before processing any records.
Finalization SQL
Specifies the SQL statement to execute after processing all records.
Ports
Input Ports
0 - DataFlow dataset
Delimited Text Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteDelimitedText Operator to Write Delimited Text.
Specifies a node for writing a text file of delimited records. Records are separated by a known character sequence. Typically, new lines are used, either Windows-style (CRLF) or UNIX-style (LF). Unless otherwise specified, the system-appropriate style is used.
Delimited text supports the following separators used to identify field boundaries:
• A field separator, found between individual fields
• Field delimiters, marking the beginning or end of a field value
The field separator cannot be empty. The field delimiter can be empty. It is not expected that all fields start and end with a delimiter, although if you start with a delimiter, you must end with one. Fields containing significant characters, such as whitespace and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the field delimiter, it must be escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.
Delimited files may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. These names come from the field names of the input data.
Dialog Options
Target
location
Specifies the target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for non-local filesystems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.
single output
Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input data set, although no guarantees are made as to which file any given record will be written.
write file locally instead of in cluster
Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.
Settings
record separator
Specifies the record separator, marking the end of a record. Common separators are selectable, including:
• automatically using the system default
• UNIX-style (LF)
• Windows-style (CRLF)
Other non-empty sequences may also be entered as the record separator.
field separator
Specifies the field separator that separates the individual fields. Common separators are selectable from the list, although any non-empty sequence may also be entered as the field separator.
field delimiters
Specifies the field delimiters, marking the beginning or end of a field value. These can either be the same (as is usually the case) or different. An empty sequence can only be used if both delimiters are the same, indicating no field delimiters.
line comment
Specifies the line comment marker, indicating the line is a comment, not a record. Comments are only detected at the start of a record, after a record separator or at the beginning of the file.
header row
Indicates whether the input contains a header row.
Encoding (Advanced)
character set
Specifies the character set used to encode the file. A number of common encodings are selectable, although one may also be entered if not present in the list.
Schema (Advanced)
null indicator
Specifies a string used to represent NULL valued fields in the output.
date format
Specifies the format to use for date-valued fields in the output. This may either be the default ISO-8601 date (yyyy-MM-dd format) used by DataFlow or a user-specifed format.
timestamp format
Specifies the format to use for timestamp-valued fields in the output. This may either be the default ISO-8601 time stamp (yyyy-MM-dd'T'hh:mm:ss.SSSZ format) used by DataFlow or a user-specifed format.
Buffers (Advanced)
write buffer
Specifies the size in kilobytes of the buffer to use for performing I/O.
encode buffer
Specifies the size in kilobytes of the buffer to use for encoding character data into bytes.
Metadata (Advanced)
Save metadata with data
Specifies whether metadata about the ordering and distribution should be saved with the data.
Ports
Input Ports
0 - DataFlow dataset
HBase Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteHBase Operator to Write to HBase.
This node allows writing to
Apache HBase.
Cluster
The first step in configuring the node is to specify an HBase Cluster and at least one mapping method. You may select any cluster that you have configured in the
HBase Clusters preferences page.
A mapping may be provided by hand by specifying an HCatalog database and table, or both.
• To write to HBase without using HCatalog, provide a field mapping manually.
• To create a new HCatalog table, provide both a field mapping and an HCatalog database and table; the schema you provide will be written to HCatalog.
• To write to a table already in HCatalog, simply specify the HCatalog database and table.
After you have specified a cluster and mapping method, click Connect to connect to the HBase cluster. Proceed to the field mapping or the HCatalog table selection.
Field Mapping
After a connection to the cluster is established, a list of available tables will be loaded. Select the table you would like to write, or enter the name of a new table, and click Get Schema. This will load available schema information so you can map HBase qualifiers to input fields.
If the table was previously written by the DataFlow HBase Writer, the reader will be able to retrieve schema information about that table, including qualifiers and their types. This information will be automatically loaded into the schema mapping list, where you can assign an input field to each qualifier. You may also manually add new qualifiers using the Add Qualifier button.
If the table has not been written by the DataFlow HBase Writer, the only information available will be the table’s families. Select a family and click Add Qualifier to add a new qualifier to write. After a qualifier is added, you can specify an input field name.
If the table does not exist, you will have to manually add the families and qualifiers you want to write. Click Add Family to add a new family. Within each family, click Add Qualifier to add new qualifiers to that family. After you have added families and qualifiers, you can specify input fields for each qualifier.
Note: When writing to a new table, the table will be created and its schema locked in once you click OK or Apply in the configuration window. Ensure you have added all the families you will need before you click OK or Apply.
In the DataFlow HBase Writer, new qualifiers can be created as either a scalar type or a record type. Scalar types store a single value in a cell, which can be read by any HBase client. Record types allow you to store multiple fields in a single HBase cell. Using record types increases the space and time efficiency, as only a single lookup is needed to read or write an entire set of fields; however, they are stored in a format unique to DataFlow, and the schema for these records remains fixed once they are created.
To create a record-type qualifier, click Add Qualifier, and when prompted for the type, select Record. After you have added the record-type qualifier to the schema, select it and click Add Field to add subfields to the record. You can then specify input fields for each subfield.
Note: After you have created a new record-type qualifier, its schema cannot be modified. Ensure you have added all the subfields you will need before you click OK or Apply.
In addition to families and qualifiers, you can also map the row ID and row timestamp to an input field. In most cases, both of these are optional: if the row ID is not specified, a unique ID will be generated for each row; if the row timestamp is not specified, the actual insert time will be used. However, when using HCatalog, you must specify a row ID.
HCatalog
After a connection to the cluster is established, a list of available HCatalog databases are loaded. Select the database, and a list of tables in that database is loaded. Select the HCatalog table you want to write to , or enter the name of a new table to create.
Dialog Options
HBase Master
HBase Cluster
Specifies the HBase cluster to which to connect. These are configured in the
HBase Clusters preferences page.
Field Mapping
Select this option if you want to specify a field mapping.
HCatalog
Select this option if you want to write to a table in HCatalog.
Connect
Connects to the specified HBase master. This connection must be established before further configuration options can be set.
Cancel
Cancels the connection attempt in progress.
Disconnect
Disconnects from the HBase cluster to select a new one.
Table
Table
Specifies the table to write.
Get Schema
Queries HBase for schema information about the selected table. This schema information must be retrieved before further configuration options can be set.
If the table was previously written by DataFlow, a full schema, with families, qualifiers, subfields, and types will be available.
If the table has not been written by DataFlow, only the families will be available. You will need to specify the qualifiers to be written manually.
If the table does not exist, you will need to specify the families and qualifiers manually.
Cancel
Cancels the schema retrieval attempt in progress.
Mapping Table
Source HBase Cell
Specifies the name of the family, qualifier, or subfield in HBase.
Type
Specifies the type of the cell.
Include
Indicates whether the corresponding cell should be written by the node.
Source Field Name
Specifies the name of the input field which will contain data from the corresponding cell.
(Row ID)
If Include is selected for this field, its value will be used as the HBase row key; otherwise, an automatically generated key will be used.
(Timestamp)
If Include is selected for this field, its value will be used as the HBase timestamp; otherwise, the actual insertion time will be used.
Add Family
Adds a family. A dialog will prompt you for the name of the family. Note that you can only add a family when first creating a new table.
Add Qualifier
Adds a qualifier to the currently selected family. A dialog will prompt you for the name of the qualifier, as well as the type. Scalar types store a single value, while record types store multiple fields in a single cell.
Add Field
Adds a subfield to currently selected record-type qualifier. A dialog will prompt you for the name of the subfield. Note that you can only add a subfield when first creating a new record-type qualifier.
Rename
Renames the selected family or qualifier. A dialog will prompt you for the new name.
Remove
Removes the selected family or qualifier.
HCatalog
Database
Specifies the HCatalog database from which to read or write the table schema.
Table
Specifies the HCatalog table from which to read or write the table schema.
Ports
Input Ports
0 - DataFlow dataset
Load Actian Matrix
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the LoadMatrix Operator to Load Actian Matrix.
The Load Actian Matrix node bulk loads data into a Matrix database instance. The loader directly loads data into Matrix using an ODI connector. Loading supports parallel and distributed execution (on a cluster).
Connections to Actian Matrix must be preconfigured using the Actian Matrix Connections preference page in the Actian preferences section. After a connection is configured, it can be used to load data by selecting it from the connection list in the configuration dialog.
To configure a Matrix connection, see
Actian Matrix Connections.
By default, the loader maps source fields to the target database fields by position. If too few input fields are present, the unmapped database fields are loaded with null values. If any of the unmapped fields are nonnullable, a loader error occurs.
To manually map the source fields to the target table fields, click the Map Fields button to open a mapper dialog. You can click and drag source fields to a target database column, or click one of the map buttons to map by name or position. The source field can also be selected using the list available on each target field.
When the loader is executed, the input data is transformed into loader format in a temporary data space. After the transformation is complete, the staged loader files are loaded into the database. When the load finishes, the temporary files are removed.
When the load finishes, the output of the load command is logged at the INFO level. Log entries can be viewed in the console or by browsing the log file.
Network communication between DataFlow and the Matrix database is accomplished using TCP/IP sockets. The properties startPortRange and endPortRange can be used to limit the range of TCP/IP ports utilized for this communication. The ports within this range need to be available for two-way communication. In a firewall environment, the ports must be opened within the firewall configuration. The default port range is from 40000 to 41000 (exclusive).
When you run multiple workflows concurrently or run workflows with multiple Load Matrix operators, ensure the port range is large enough to allow at least one open port per slice per operator. Also, enure Matrix database workload management is configured to allow enough concurrent queries to leave at least one available for each active Load Matrix operator. This availability is configured using the max_concurrent_queries property in ~paraccel/padb/rel/etc/padb_wlm.xml on the Matrix server.
You may configure a DataFlow service class with specific properties for DataFlow jobs, which will allow the loader to check for available resources before launching a job and fail quickly if there are insufficient resources on the Matrix server; without this service class, a job may hang when there are insufficient resources. For more information about configuring this service class, see Installing and Configuring DataFlow for Matrix On-Demand Integration (ODI).
Data is loaded into Matrix in parallel. A data connection into Matrix is created for each parallel stream within DataFlow. The number of data connections can be limited by setting the max data connections property (see below). Setting this value may cause a redistribution of the data before loading. Consider setting the overall parallelism for the loader job to the desired number of connections to prevent data redistribution.
Note: Setting the maxConnections property may incur the overhead of data redistribution. Data redistribution occurs to gather data from a higher number of data streams to a lower number of data streams within DataFlow.
Dialog Options
Connection
Selects a preconfigured connection to a Paraccel MPP database. To configure a Paraccel MPPconnection, see
Actian Matrix Connections.
Table Name
After a connection is established, you can select a target table. Alternatively, use the Create Table button to enter a CREATE TABLE command to execute in SQL.
Create Table
Opens a dialog that allows the execution of SQL commands using the configured connection.
Log Frequency
Specifies the frequency to log loader progress. Defaults to zero (no logging).
Trace Level
Specifies the trace (or debug) level used within the database for the load operation. Defaults to zero (no trace logging). After loading, if trace logging is enabled, query the STL_UDF_TRACE table to view the trace logs.
Initialization SQL
Specifies the SQL statement to execute before processing any records.
Finalization SQL
Specifies the SQL statement to execute after processing all records.
Start of Port Range
Sets the starting port in a range of network ports allocated for DataFlow to Matrix communication.
End of Port Range
Sets the last port in a range of network ports allocated for DataFlow to Matrix communication.
Max Data Connections
Specifies the maximum data connections desired for loading data. The default is a connection per DataFlow parallel stream.
Field Mapping
Click Map Fields to display a dialog that allows mapping from the source data fields to the target table columns. Three buttons are provided between the source and target tables:
• Map by name: Maps the source fields to target columns where the names match exactly.
• Map by position: Maps the source fields to target columns by position, that is, the first input field is mapped to the first target column and so on.
• Clear: Clears any mapping already completed.
To map a specific source field to a specific target column, click and drag the source field name to the source field in the target table (specifically the dataset field). If a field is not included in the mapping, it will not be included in the data written to the database.
You also may customize the SQL type if you want a more specific type than the default. This is the type used when configuring tables that do not exist or that will be dropped and recreated.
Ports
Input Ports
0 - Source data
Matrix Sink
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the MatrixSink Operator to Load Actian Matrix from SQL.
The Matrix Sink node loads large data into a Matrix database instance. This node is similar to
Load Actian Matrix node. Refer to this node for more information about the process for loading large data.
The purpose of the Matrix Sink node is to embed in workflows that are executed from Matrix SQL. Workflows having the Matrix Sink node can be directly exported to a Matrix instance using the
Exporting an Actian DataFlow Workflow. Execution of workflows from Matrix SQL requires the
invoke_dataflow user-defined function (UDF) to be installed on the Matrix instance. For more information, see
Installing and Configuring DataFlow for Matrix On-Demand Integration (ODI).
Note: Workflows that are executed from Matrix SQL must have only one instance of the Matrix Sink node.
Dialog Options
There is no defined configuration for the Matrix Sink node. All the required settings are provided to the node by the invoke_dataflow UDF before executing the workflow.
Ports
Input Ports
0 - Source data
ORC File Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteORC Operator to Write Apache ORC Files.
This node can write records in
Apache Optimized Row Columnar (ORC) format files.
Dialog Options
Target
Location
Specifies the target location into which to write the output as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.
Note: Do not select append mode as the write mode. If selected, an exception within KNIME is thrown when the configuration is saved or applied from the dialog.
Write single output file
Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input data set, although no guarantees are made as to which file any given record will be written.
Write file locally instead of in cluster
Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.
Settings
Block padding
Switches block padding in ORC format on or off.
Buffer size
Specifies the size of the ORC writer output buffer.
Compression algorithm
Specifies the compression format to use within the ORC file.
ORC version
Specifies the version of the ORC file format to use.
Row index stride
Specifies the distance between entries in the ORC row index. This value must be >= 1000. If disabled, no index is created.
Stripe size
Specifies the maximum size of a stripe inside the ORC file.
Port
Input Port
0 - Input data-set
PMML Writer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WritePMML Operator to Write PMML.
Writes a PMML model to a file. If you are using KNIME version 2.10 or later, then PMML version 4.2.x is supported. If you are using KNIME 2.9.x, then PMML version 4.0 or 4.1 is supported. For PMML definition, visit the website
Data Mining Group.
Ports
Input Port
0 - DataFlow PMML Model
Staging Exporter
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the WriteStagingDataset Operator to Write Staging Data Sets.
Exports a DataFlow staging dataset. This allows you to exchange datasets between KNIME workflows. In addition, the output format is the standard consumed by the DataFlow library operator ReadStagingDataset, so it can be easily and efficiently consumed by other applications built on the DataFlow library.
Dialog Options
Target
location
Specifies the target location into which to write the output. This is specified as a DataFlow-style file path, supporting URI-like syntax for nonlocal file systems, such as HDFS. The Browse button can be used to browse the local and any configured remote file systems for a location.
In addition to a location, a write mode must be chosen, indicating how to behave when the target location already exists.
single output
Indicates whether a single output file or multiple files (one per parallel partition) should be written. Writing multiple files can be more efficient than writing a single file, as the I/O can be done in parallel. If multiple files are being written, the location is interpreted as a directory into which to write the files. Collectively, the files will contain the entire input dataset, although no guarantees are made as to which file any given record will be written to.
write file locally instead of in cluster
Determines whether the writer should write the file locally on the client or in the cluster. This disables parallelism on the node.
Settings
staging format
Specifies the format to use for storing data. Compact row is most efficient for reading all columns and uses the least memory. Columnar is most efficient for reading a small subset of columns but at a cost of using more memory.
records per block
Specifies the number of records to store per format block. Larger values require more memory to encode blocks. There is generally no reason to alter this value from the default.
Buffers (Advanced)
write buffer
Specifies the size in kilobytes of the buffer to use for performing I/O.
Ports
Input Ports
0 - DataFlow dataset
Force Staging
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ForceRecordStaging Operator to Explicitly Stage Data.
Useful when creating a workflow intended for the DataFlow Executor to force staging to occur. This node forces partitioning of DataFlow graphs into smaller graphs at its placement (cut points). This is useful for preventing queue expansion that would otherwise occur.
Ports
Input Ports
0 - DataFlow dataset
Output Ports
0 - DataFlow dataset
Log Rows
KNIME This topic describes a KNIME node. It is useful for testing the behavior of other nodes during workflow development.
Logs information about the input data from a flow. The record type of the flow is logged. A log frequency is used to determine which data rows are logged. A data row is logged by outputting the value of each input field within the row. A format can be provided that specifies output format for each row log. The final row count is also logged.
Dialog Options
Format
Specifies the format to use when logging rows. See
MessageFormat for more information about the syntax of formats. Two variables are passed to the format: the row count and the row contents. The default format is
row {0} is {1}.
Log Frequency
Specifies the frequency of rows to log. Setting the frequency to 1 logs every row, a value of 2 logs every other row and so on. The default frequency of 0 turns off row logging.
Ports
Input Ports
0 - Input
Parse Text Fields
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the ParseTextFields Operator to Parse Text Records.
Specifies a node for parsing records consisting of text fields. These records typically represent data which has been read from a file and broken into individual fields based on some criteria, but with no additional interpretation performed. This node can be used to interpret the text fields as configured data types, making the result suitable for use with other operators.
The provided schema is used to determine the output type. Input fields are parsed according to the format specified for the field of the same name in the output schema. Fields present in the input but not named in the output schema are dropped. Fields named in the output need not exist in the input; their value will be NULL in the output.
Dialog Options
Schema
The schema pane is used to define the structure of the output schema, identifying the field names and types.
Name
Defines the name of the selected field. Field names must be unique.
Type
Defines the data type of the selected field. Additional refinement of the type describing the allowed values and format will be editable in the Field Details, as is appropriate.
Default Null Indicator
Defines the default string value used to indicate a null in text. A number of commonly used options are predefined for selection.
Generate Schema
Replaces the current schema with a default one based on the available input fields.
Load Schema
Reads a schema from a file, replacing the current schema.
Save Schema
Writes the current schema to a file for future use. The schema must currently be in a valid state before it can be saved.
Field Details
The field details pane is used to refine the definition of the currently selected schema field.
Format
Defines the text formatting for the selected field, if allowed for the selected field type. A number of commonly used formats are predefined for selection. Formatting is type-dependent:
• Appropriate formats for dates and timestamps are any supported by SimpleDateFormat.
• Appropriate formats for numeric types are any supported by NumberFormat.
• Strings do not support custom formats, although they do support choosing whether whitespace should be trimmed.
If default formatting for the type should be used, select Use default format. For numeric types, default formatting is default Java formatting. For dates and timestamps, default formatting follows ISO-8601 formatting.
Null Indicator
Defines the string value indicating a null in the text for the selected field. A number of commonly used values are predefined for selection.
If the schema default should be used, select Use default null indicator.
Allowed Values
If the selected field type is enumerated, the possible values that the enumeration can take are specified here.
Ports
Output Ports
0 - Output-parsed data rows.
1 - Output data rows that could not be parsed successfully for all text fields.