Transformation Nodes

The Cross Join node combines two input flows and produces the cartesian product of the sets. The output is typed using the merge of the two input types. If the right type contains a field already named in the left type, it will be renamed to avoid collision.

Use this node with caution. It produces a product of its two input data sets. Even with fairly small data sets, the amount of output data can be large.

Dialog Options

Row Buffer Size

Specifies the size (in rows) of the memory buffer used to cache input rows for join processing. Larger values can increase performance due to decreased intermediate file buffering.

Ports

Input Ports

0 - Input port representing the Left data set.

1 - Input port representing the Right data set.

Output Ports

0 - Output port containing the results of the cross join operation.

Group

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Group Operator to Compute Aggregations.

The Group node aggregates data within key groups. A set of aggregation functions is available, including average, count, minimum value, maximum value, standard deviation, and others. If no key fields are provided, the aggregation functions are applied to all of the data as one large group.

The node uses groups of consecutive equal keys ("key groups") to determine which data values to aggregate. The input data need not be sorted; if it is already sorted, performance will be optimal.

The output of the Group node will include the key fields (if specified) and the outcome of each aggregation function. A row of output is generated for each distinct set of key values. If no key fields are specified, a single row will be output with the aggregation results for all input rows.

Dialog Options

Key Fields

Specifies the fields to use as keys that define key groups for aggregations. If using the sort mode, the data must be sorted by the given key fields. Key fields are not required. If key fields are not supplied, the data will be aggregated as a single key group.

Aggregations

Aggregation functions can be applied to data fields within the input data source. Click the Add button to create a new aggregation operation. Select an aggregation function and the input field. Some aggregations such as correlation require two input fields.

Ports

Input Ports

0 - Input data to aggregate.

Output Ports

0 - Output data containing aggregation results.

Join

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Join Operator to Do Standard Relational Joins.

The Join node joins two inputs by the specified key fields. Inner joins and the various types of outer joins are supported by specifying the join type.

In all join modes, repeated key values on the left and right sides are supported. If a key group contains multiple rows on both sides, then all combinations of left and right rows in the key group appear in the output for that key group. Note that in this case, memory consumption is lower if the side with more key repetition is linked to the left input.

When using the Hash Join option, link the smaller of the two data sources to the right input to reduce memory consumption. When using Hash Join, there is no requirement to sort the input data.

Dialog Options

Join type

Specifies the type of join to execute: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER.

Hash Join

Uses the join operator that creates an in-memory index from the right side input. Using this option requires all of the data from the right side input to fit into memory. If not using hash join, both left and right sides will be sorted if they are not already.

Merge key fields

Sets whether to merge the left and right key fields into a single field for output.

Filter Inputs

Opens a dialog that lets you set which fields from the left and right inputs will be pre-filtered before the join.

Left key fields

Specifies the fields of the left input data set to use as the key fields in the join operation.

Right key fields

Specifies the fields of the right input data set to use as the key fields in the join operation.

The Join Condition can optionally be set using the Simple Predicate view or Predicate Expression view.

In the Simple Predicate view:

Field conditions

Use the Add and Remove buttons to add or remove logical predicates from the criteria list.

All conditions true

Specifies that all the predicates must be true for the join condition.

Any conditions true

Specifies that at least one predicates must be true for the join condition.

In the Predicate Expression view:

Insert Operator

Adds a selected operator to the expression at the current edit position.

Insert Function

Adds a selected function to the expression at the current edit position.

Insert Field Reference

Adds a selected field to the expression at the current edit position.

Check Expression

Validates the expression text, checking whether it is syntactically correct and evaluates to a Boolean value.

Reset

Restores the expression to the one defined in the Simple Predicate view.

Ports

Input Ports

0 - Input port representing the Left data set.

1 - Input port representing the Right data set.

Output Ports

0 - Output port containing the results of the join operation.

Union All

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the UnionAll Operator to Create Union of Data Sets.

The Union All node combines two input flows into a single flow. Usually no guarantee is provided as to the ordering of the rows in the resulting flow. The operator attempts to provide data with the minimum of delay, following a first-come, first-served policy as closely as possible. However, if the data order specified in the metadata of both input flows is the same, then the node will attempt to preserve the sorted data order. The metadata will have its data order specified if the flow has gone through a Sort or Assert Sorted node.

The output type can be provided by using the Schema Editor. If a schema is not provided, the output type, including the field names, can be automatically derived based on the settings provided.

If a schema is provided, the inputs will be manipulated to match the output schema. The fields are mapped from the inputs to the output by field name. Fields contained in an input but not in the output will be dropped. Fields contained in the output but not in an input will contain null values. The input fields will be converted to their matching output field type as needed. If a conversion is not supported, the configuration will fail.

For example, if an input field has the type string and the output type is a date, the input values will be converted according to the schema specification for the output field. In the case of a date, a pattern to use for the conversion can be provided. If one is not provided, a default pattern will be applied.

Dialog Options

Automatically Configure Schema

Automatically determines the appropriate output schema based on the selected options.

Map by Postion

Automatically maps the fields in the inputs by position.

Map by Name

Automatically maps the fields in the inputs by name.

Include Extra Fields

Includes extra fields that are only present in one side of the input when automatically determining the output schema. Otherwise, extra fields will not be included in the output.

Manually Configure Schema

Enables the Schema Editor for manual configuration of the output schema. This enables the schema preview.

Schema Editor

Used to define the structure of the output schema, identifying the field names and types.

Name

Defines the name of the selected field. Field names must be unique.

Type

Defines the data type of the selected field. Additional refinement of the type describing the allowed values and format will be editable in the Field Details, as is appropriate.

Default Null Indicator

Defines the default string value used to indicate a null in text. A number of commonly used options are predefined for selection.

Generate Schema

Replaces the current schema with a default one based on the available input fields.

Load Schema

Reads a schema from a file, replacing the current schema.

Save Schema

Writes the current schema to a file for future use. The schema must currently be in a valid state before it can be saved.

Field Details

Used to refine the definition of the currently selected schema field.

Format

Defines the text formatting for the selected field, if allowed for the selected field type. A number of commonly used formats are predefined for selection. Formatting is type-dependent:

• Appropriate formats for dates and timestamps are any supported by SimpleDateFormat.

• Appropriate formats for numeric types are any supported by NumberFormat.

• Strings do not support custom formats, although they do support choosing whether whitespace should be trimmed.

If default formatting for the type should be used, select Use default format. For numeric types, default formatting is default Java formatting. For dates and timestamps, default formatting follows ISO-8601 formatting.

Null Indicator

Defines the string value indicating a null in the text for the selected field. A number of commonly used values are predefined for selection.

If the schema default should be used, select Use default null indicator.

Allowed Values

If the selected field type is enumerated, specifies the possible values that the enumeration can take.

Ports

Input Ports

0 - Left Input

1 - Right Input

Output Ports

0 - The union of the two inputs

Filter Nodes

Filter Existing Rows

Filter Existing Rows performs filtering of a dataset based on intersection with another dataset.

Filter Rows

Filter Nodes filters rows based on defined field predicates.

Limit Rows

Filter Nodes selects a subset of the input dataset within a specified range of positions.

Random Sample

Filter Nodes randomly samples input data.

Remove Duplicates

Remove Duplicates removes duplicate rows.

Select Fields

Select Fields selects (filters) fields in a record flow.

Filter Existing Rows

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterExistingRows Operator to Filter by Data Set Membership.

Filters the left input based on the presence of key values in the right input. Performs both a semi-join and anti-join of the data.

In the semi-join, only rows from the left that match one or more rows on the right using the keys are output, whereas in the antijoin, only rows from the left that match no rows on the right are output.

Dialog Options

Hash Join

Use the join operator that creates an in-memory index from the right side input. Using this option requires all of the data from the right hand side input to fit into memory. If not using hash join, both left and right sides will be sorted if they are not already.

Left key fields

Specifies the fields of the left input dataset to use as the key fields in the join operation.

Right key fields

Specifies the fields of the right input dataset to use as the key fields in the join operation.

The Join Condition can optionally be set using the Simple Predicate view or Predicate Expression view.

In the Simple Predicate view:

Field conditions

Use the Add and Remove buttons to add or remove logical predicates from the criteria list.

All conditions true

All the predicates must be true for the join condition.

Any conditions true

At least one predicate must be true for the join condition.

In the Predicate Expression view:

Insert Operator

Adds a selected operator to the expression at the current edit position.

Insert Function

Adds a selected function to the expression at the current edit position.

Insert Field Reference

Adds a selected field to the expression at the current edit position.

Check Expression

Validates the expression text, checking whether it is syntactically correct and evaluates to a Boolean value.

Reset

Restores the expression to the one defined in the Simple Predicate view.

Ports

Input Ports

0 - Input port representing the left dataset.

1 - Input port representing the right dat set.

Output Ports

0 - Output port containing the results of the semi-join operation.

1 - Output port containing the results of the anti-join operation.

Filter Rows

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterRows Operator to Filter by Predicate.

Filters rows from the source input based on the predicate defined in the configuration. Records passing the predicate—those for which the predicate evaluates to true—are emitted on the standard output.

Records that fail the predicate are emitted on the rejected output.

The filter predicate can be defined in two different ways. In the Simple Predicate view, the filter is defined as a compound statement of simple predefined field predicates. In the Predicate Expression view, the predicate is defined using a built-in expression language, allowing for arbitrarily complex predicates to be defined. The Expression Language is defined at Expression Language.

To use the Simple Predicate view, select the appropriately named tab. The field predicates are displayed as a table. The first column chooses the field to which the predicate applies. The second chooses the predicate. The third provides an additional argument to the predicate; some predicates only require a single input, in which case the third column is not editable for the row. When editing the third column:

• If the predicate contains FIELD label, then pick a field in the input data on the right-hand dropdown.

• If the predicate contains VALUE label, then enter a compatible value in the right-hand text field.

The individual field predicates are combined to produce the filter predicate according to the setting of the radio buttons below the table. If the All Conditions True option is selected, all the predicates must be true; the field predicates are combined using a logical and. If the Any Conditions True option is selected, at least one predicate must be true; the field predicates are combined using a logical OR.

To use the Predicate Expression view, select the appropriately named tab. A text area will be displayed in which you can enter an expression. If you are moving to this view from the Simple Predicate view, the expression is initially equivalent to the one defined in the simple predicate table. Because the simple view cannot represent all predicates, when you are editing the expression in the Predicate Expression view, the simple predicate tab is disabled. It can be reenabled only if:

• The expression is empty.

• The expression matches the original expression in the simple view. The expression can be reset to this value at any time by clicking the Reset button.

Dialog Options

In the Simple Predicate view:

Field conditions

Use the Add and Remove buttons to add or remove logical predicates from the criteria list.

All conditions true

All the predicates must be true for the record to be pushed to the output port.

Any conditions true

At least one predicate must be true for the record to be pushed to the output port.

In the Predicate Expression view:

Insert Operator

Adds a selected operator to the expression at the current edit position.

Insert Function

Adds a selected function to the expression at the current edit position.

Insert Field Reference

Adds a selected field to the expression at the current edit position.

Check Expression

Validates the expression text, checking whether it is syntactically correct and evaluates to a Bboolean value.

Reset

Restores the expression to the one defined in the Simple Predicate view.

Ports

Input Ports

0 - Input

Output Ports

0 - Output rows that met the conditions of the configuration.

1 - Output rows that did not meet the conditions of the configuration. This is the inverse of the output data.

Limit Rows

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LimitRows Operator to Limit Output Rows.

Select a subset of the rows from the input based on an initial offset and maximum number. Only rows that are positioned within the specified range are output.

Dialog Options

Start with row

Specifies the position of the first row in the input set to output. The count of rows begins with this row. Input rows before this position are discarded.

Maximum number of rows

Specifies the maximum number of rows from the input set to output. Once this many rows have been output, any remaining input rows are discarded.

Ports

Input Ports

0 - Input

Output Ports

0 - Output rows between the specified positions.

Random Sample

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SampleRandomRows Operator to Sample Data.

This node randomly samples rows from the input data set. Selected rows are pushed to the output of the node. Two modes for sampling are supported: Relative and Absolute.

Dialog Options

Sample Mode

Two modes are supported:

Relative

Uses a given percentage of rows from the total data. The number of output rows depends on the number of input rows. A larger number of input rows raises the sample size.

Absolute

Uses a given sample size for the number of output rows. A larger number of input rows does not change the sample size.

Percentage

Specifies the percentage of input data to output. The value must be between 0.0 and 1.0 (exclusive). This setting is used only in Relative mode.

Sample Size

Specifies the number of rows wanted in the sample output. The actual number of rows may vary because of the random nature of the sampling. This setting is used only in Absolute mode.

Random Seed

Specifies a seed value for the random number generator. Different seed values can make the output vary slightly in Absolute mode.

Ports

Input Ports

0 - Input data to sample

Output Ports

0 - Sampled data

Remove Duplicates

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RemoveDuplicates Operator to Remove Duplicates.

The Remove Duplicates node removes duplicate rows based on a specified set of group keys. The “first” record of a key value group is pushed to the output. Other records with the same key values are ignored. The “first” record of a key group is determined by sorting all rows of each key group by the specified sort keys. If the sort keys are unspecified, then an arbitrary row is output.

Dialog Options

Group Keys

Specifies the fields used as the group keys.

Sort Keys

Specifies the fields used as sort keys and the sort ordering.

Ports

Input Ports

0 - Input data

Output Ports

0 - Output rows with duplicates removed

Select Fields

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SelectFields Operator to Select Fields.

Constructs a new record flow with a subset of fields from the source record. Use the Selection check boxes to include or exclude fields, and use the Fields (renamed) text boxes to rename them. The resulting record flow has exactly those fields selected and is in the same order as specified in the selection table, with the new names where applicable.

Dialog Options

Fields Selection

Specifies the selection and naming of fields to include in or exclude from the output:

1. Selection - Check to include or clear to exclude a given field.

2. Field (original) - Specifies the original name of the field from the input data.

3. Field (renamed) - Specifies the corresponding renamed field for the output result.

Operations

Standard operations on field selection:

1. Move Up - Moves the selected fields up by one position.

2. Move Down - Moves the selected fields down by one position.

3. Invert Selection - Inverts (check or clear) the inclusion status of selected fields.

4. Invert All - Inverts (check or clear) the inclusion status of all fields.

5. Select All - Includes all fields.

6. Select None - Excludes all fields.

7. Restore Defaults - Resets the configuration to the original state (all fields included with their original names). Warning: This action removes all custom changes.

Ports

Input Ports

0 - Input

Output Ports

0 - Output with the selected fields in the exact order and names as configured.

Manipulation Nodes

Assert Sorted

Assert Sorted asserts the input data is already sorted.

Columns To Rows

Columns To Rows - Unpivot performs an unpivot of the input data, transposing values from the columns into multiple rows.

Date Value Extraction

Date Value Extraction extracts individual fields from a date or timestamp data field.

Derive Fields

Derive Fields derives new fields from existing fields, using specified expressions.

Missing Value

Missing Value replaces missing values in the input dataset according to the configured actions.

Normalize Values

Normalize Values normalizes values for the selected input fields.

Partition Data

Partition Data forces data to be repartitioned.

Randomize Order

Randomize Order reorders input in a random fashion.

Rank Fields

Rank Fields ranks data fields using the given rank mode.

Regular Expression

Regular Expression performs Regular Expression operations on text input.

Rows To Columns:

Rows To Columns - Pivot performs a pivot of the input data, applying the specified aggregations.

Run Javascript

Run JavaScript runs JavaScript logic for every record of the input.

Run R Snippet

Run R Snippet runs an R script that accepts the input data and produces a result.

Run Script

Run Script runs a script that is invoked for every record in the input data.

Sort

Sort sorts the input data.

Split Field

Split Field splits a string field into multiple fields, using a specified delimiter.

Substring

Substring creates substrings from existing string fields.

Time Difference

Time Difference performs time difference calculations on input data fields.

Trim Whitespace

Trim Whitespace trims leading and trailing whitespace from the specified String fields in the Input record flow.

Type Conversion

Type Conversion allows conversion of fields to compatible types.

Assert Sorted

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the AssertSorted Operator to Assert Data Ordering.

This node asserts that input data is already sorted based on configured source fields and their sort order. Null values sort higher than non-null values under ascending order, lower under descending order. If the data is not sorted, the node will fail during execution.

Using this operator can help avoid unnecessary sorts in nodes connected to the output, as consumers are guaranteed that the data they receive is sorted as specified.

Dialog Options

Field Name

Specifies the source field name expected to be sorted.

Sort Order

Specifies Ascending or Descending sort order of the specified field. Null values sort higher than nonnull values under ascending order, lower under descending order.

Ports

Input Ports

0 - Input port containing already sorted data.

Output Ports

0 - Output port containing the same data as the input.

Columns To Rows - Unpivot

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ColumnsToRows Operator to Convert Columns to Rows (Unpivot).

The Columns To Rows node normalizes records by transposing values from columns into multiple rows. You define one or more pivot value families to describe the output fields and the source fields that will be mapped to these fields in the output. Target field datatypes are determined by finding the widest common type of the pivot elements. An exception can occur if no common type is possible. The number of elements defined in a Key Family or Value Family must be equal.

You may optionally define a single pivot key family. This lets you provide a context label when transposing rows defined in value families. You set a list of Strings that correspond positionally to the fields defined in the list portion of the defined value families. The size of all lists across both value families and the key family must be the same.

You may optionally define Group Key Fields that will be the fixed or repeating value portion of multiple rows. If this property is unset, group key fields will be determined by the remainder of the source fields not specified as pivot elements.

The following table provides an example set of data that we want to unpivot to a more normalized contact, type, address, and number format.

Contact	Home Address	Home Phone	Business Address	Business Phone
John Smith	8743 Elm Dr.	555-1235	null	555-4567
Ricardo Jockamo	1208 Maple Ave.	555-7654	123 Main St.	null
Sally White	null	null	456 Wall St.	null

To accomplish this unpivot, a pivot key family will be created with the field name of "type" and the labels "home" and "business". We will define two pivot value families with the names "address" and "number". The address family will contain the fields "HomeAddress" and "BusinessAddress", which will map to the home and business types we defined as the pivot key family. The number family will contain the fields "HomePhone" and "BusinessPhone", which will similarly map to the family types. Finally, we will set the "Contact" field as a group key so it will be included in each row. This produces the following unpivoted table.

Contact	type	address	number
John Smith	home	8743 Elm Dr.	555-1235
John Smith	business	null	555-4567
Ricardo Jockamo	home	1208 Maple Ave.	555-7654
Ricardo Jockamo	business	123 Main St.	null
Sally White	home	null	null
Sally White	business	456 Wall St.	null

If, during the columns-to-rows process, all mapped values to the pivot columns are found to be null, that row will not be produced in the output. In the example above, if Pivot Key Family was not defined, the Sally White "home" record would not have appeared in the output since both elements are null. Only because a Pivot Key Family was defined does this record appear in the output.

Dialog Options

Pivot Key Field Name

Specifies the name of the field that will contain the family labels if used.

Pivot Key Family

Specifies an ordered list of labels that will be used for each value family.

Pivot Value Families

Specifies a list of the fields that will be added to the output and the source fields that will unpivoted into each field in the output.

Group Keys

Specifies an optional list of input fields that will comprise the fixed portion of the output.

Ports

Input Ports

0 - Source data

Output Ports

0 - Unpivoted data

Date Value Extraction

KNIME This topic describes a KNIME node. It is based on a variation of the DataFlow operator DeriveFields, described in Using the DeriveFields Operator to Compute New Fields.

The date value extraction node extracts individual values from either a date or a timestamp data type. Multiple values may be extracted from each valid input field. Select the option on the configuration dialog to add a new extraction specification. After a new extraction is added, you can select the input field and the type of value to extract and set the output field name.

Dialog Options

Input Field

Selects an input field from which values will be extracted. Only input fields that are date or timestamp types can be selected.

Value to Extract

Specifies the value type to extract. The following values are supported:

• YEAR: four-digit year

• MONTH: month of year, from 1 to 12

• WEEK_OF_YEAR: week of the year, from 1 to 52

• HOUR_OF_DAY: hour of the day, from 0 to 23

• MINUTE_OF_HOUR: minute of the hour, from 0 to 59

• SECOND_OF_MINUTE: second of the minute, from 0 to 59

• DAY_OF_WEEK: day of the week, from 1 to 7

• DAY_OF_MONTH: day of the month, from 1 to 31

• DAY_OF_YEAR: day of the year, from 1 to 366

Output Field

Specifies the name to give the output field containing the extracted value. The name must be unique and must be provided.

Ports

Input Ports

0 - Input data

Output Ports

0 - Original data plus extracted values

Derive Fields

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields.

Derives new fields from existing fields, using specified expressions.

Each output field is given an expression to be evaluated, and the result of that evaluation is placed in the output field. Input fields are passed to the output unless they are overridden or Drop Underived Fields is selected. Input fields can be overridden by creating a field derivation with the same output field name.

The expressions must follow the DataFlow expression language. This language supports standard equivalence operators (=, >, <, >=, <=), Boolean operators (and, or, not), arithmetic operators (+, -, *, /), and many more functions. For information about expression syntax and functions, see Expression Language.

Dialog Options

Field Derivations

Specifies a list of output fields and corresponding expressions. The first column specifies the output field name. The second column specifies the expression to be evaluated. If any errors exist in the field name or expression, an error icon will appear to the right of the expression. Hover over this icon for more details about the error. To add a new derivation, enter a new field name or expression in the last row.

• Remove - Remove the selected field derivation.

• Up - Move the selected field derivation up.

• Down - Move the selected field derivation down.

Drop Underived Fields

If selected, the input fields will be dropped and only derived fields will be outputted.

Ports

Input Ports

0 - Source data

Output Ports

0 - Source data with derived fields added, or derived fields alone.

Missing Value

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReplaceMissingValues Operator to Replace Missing Values.

The missing values node replaces missing values in the input data using the configured set of actions. Actions may be based on supported data types or by field name. If specified by type, the action is applied to all fields of the given type within the input data.

The following actions are supported:

• Take no action (default)

• Skip (ignore) the record containing missing values

• Replace the missing value with a constant value

• Replace the missing value with the mean value

• Replace the missing value with the median value

• Replace the missing value with the minimum value

• Replace the missing value with the maximum value

• Replace the missing value with the most frequent value

Note: The definition order of the actions has significance. The actions by type are applied first. Next the actions by field name are applied. The last action that is applicable to a field is used. In this way, you may specify an action by type and override that action for a specific field by name. Only one action can be applied to any field.

Dialog Options

Actions by Column Type

Specifies the replacement actions by field type.

Actions by Column Name

Specifies the replacement actions by field name.

Ports

Input Ports

0 - Input data

1 - Optional. Summary statistics on input data for use with actions requiring statistical information (such as maximum values). If not supplied, this data is automatically calculated.

Output Ports

0 - Transformed data with missing values replaced

Normalize Values

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the NormalizeValues Operator to Normalize Values.

Normalizes values for each of the selected input fields on each input row. A new field is added to the output for each input field selected.

Statistics such as the mean and the standard deviation are calculated for each of the selected fields and used during the normalization calculation.

Dialog Options

Column Selection

Specifies the list of input fields used to calculate z-scores. Only numeric fields are supported.

Normalization Method

Specifies the method of normalization to apply. Min-Max and z-score are supported.

Ports

Input Ports

0 - Input data

1 - Optional. Summary statistics on input data to use in normalization. If not supplied, this data is automatically calculated.

Output Ports

0 - Original data plus normalized fields

Partition Data

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the PartitionHint Operator to Explicitly Partition Data.

Repartitions data across execution nodes. It can be used with the DataFlow Executor to control how DataFlow graphs are broken into sequentially executed segments. It is used for advanced performance-tuning techniques. Typically, using this node is not required.

Dialog Options

Partition Scheme

Specifies the method of partitioning data. It offers several options:

• Balanced - Data is evenly redistributed to all partitions. Each partition receives roughly the same number of rows.

• Hash - Data is redistributed so that records with the same values for key fields are sent to the same partition. There is no guarantee with respect to resulting partition sizes.

• Range - Data is redistributed based on value ranges. Ranges are automatically estimated from the data; each partition receives roughly the same number of rows.

Partition Keys

Specifies the fields to use in the partitioning scheme. The balanced scheme is independent of field values, so does not require partition keys. Other schemes require at least one key field. The ordering of keys is significant.

Ports

Input Ports

0 - Input port containing data to repartition.

Output Ports

0 - Output port containing the same data as the input, but repartitioned.

Randomize Order

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Randomize Operator to Randomize Partitioning.

The Randomize Order node reorders the input in a random fashion. The output is identical to the input, but ordered randomly and with each record on a random partition.

Dialog Options

Random seed

Sets the random seed to use.

Ports

Input Ports

0 - Input port containing data to randomly reorder

Output Ports

0 - Output port containing the same data as the input, but randomly ordered

Rank Fields

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Rank Operator to Rank Data.

Rank Fields is a node that ranks data using the given rank mode. The data is grouped by the given partition fields and is sorted within the grouping by the ranking fields.

Dialog Options

Partition Field Selection

Field Name

Displays the fields that have been selected to partition the rankings. The top most field will have highest precedence, with following fields being given reduced precedence, based on order.

Rank Field Selection

Field Name

Displays the fields that have been selected to rank the input data. The topmost field will have highest precendence, with following fields being used to determine how to order records where the primary ranked field’s data is equal.

Sort Order

Specifies the ordering that should be used when ranking a field.

Operations

Add

Adds a field to the associated table.

Remove

Removes a field from the associated table.

Move Up

Moves the selected field up in the associated table.

Move Down

Moves the selected field down in the associated table.

Clear

Clears all fields from the associated table.

Settings

Ranking Mode

Sets the ranking mode used:

• Standard - Standard or competitive ranking that leaves gaps in assigned ranks.

• Dense - Dense ranking, similar to standard but does not leave gaps.

• Ordinal - Ordinal ranking where each item receives a distinct rank.

Rank Output Field Name

Sets the name of the output field that will contain the ranking order for each record in the output data.

Ports

Input Ports

0 - Input

Output Ports

0 - Output data sorted by rank field and grouped by partitions.

Regular Expression

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields.

Performs regular expression operations on text input. The output from the regular expression is placed in the column specified by the Replacement Column Name field. If no field is specified, then a new column is created.

Dialog Options

Source column

Specifies the name of the column whose cells should be processed.

Pattern

Specifies a regular expression pattern.

Replacement

This text replaces the previous value in the cell if the pattern specified in Pattern field matches.

Target Column

Specifies the field to store the output, this can be the same as the source column.

Ports

Input Ports

0 - Source data

Output Ports

0 - Source data with replaced values or an additional column.

Rows To Columns - Pivot

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RowsToColumns Operator to Convert Rows to Columns (Pivot).

The Rows To Columns node is used to pivot data from a narrow representation (rows) into a wider representation (columns).

The data is first segmented into groups using a defined set of group keys. The ordering of the group keys is important, as it defines how the data is partitioned and ordered for the pivot operation. A pivot key field provides the distinct values that will be used as the pivot point. This field must be a string or enumerated type. A column is created in the output for each distinct value of the pivot key.

An aggregation is defined, which is performed on each data grouping defined by the group keys and the pivot key. The result of the aggregation for each unique value of the pivot key appears in the appropriate output column.

The following table provides an example set of data that we want to pivot by Region. There are only four regions: North, South, East, and West. For each item, we want to compute the total sales per region. Items can show up multiple times in a region because the data is also organized by store.

ItemID	StoreID	Region	Sales
1	10	North	1000
1	15	North	800
1	20	South	500
1	30	East	700
2	40	West	1200
2	10	North	500
2	15	North	200

To accomplish this pivot, the ItemID will be used as the group key. The Region will be used as the pivot key. And the Sales column will be the pivot value, aggregating by summing the values. The pivot key values are "North", "South", "East" and "West". The result of the pivot is shown in the following table.

Note that the sales total for the West region for item 1 is empty. Scanning the input data shows that no sales were present in the West region for item 1. Item 1 did have two sales values for the North region. Those values (1000 and 800) are summed and the total (1800) appears in the North region column for item 1. Values with a ? indicate a null or nonexistent value.

ItemID	North	South	East	West
1	1800	500	700	?
2	700	?	?	1200

The key concepts to understand in using the Rows To Columns node are:

• Using a set of columns to segment the data into groups for pivoting. This set of columns is called the group keys. The example used the ItemID as the group key.

• A categorical valued column whose distinct values are used as columns in the output. This is the pivot key. The example used the Region as the pivot key.

• A column that can be aggregated for each data grouping and within each pivot key. These are the pivot values. The example used the Sales column as the pivot value.

Dialog Options

Group Keys

Specifies an ordered list of input fields to use when grouping the input data for the pivot operation.

Pivot Key

Specifies the input field to use as the pivot key. Only string fields are supported.

Pivot Column Pattern

Specifies the naming pattern that will be used for new pivot columns. Use the special variable {0} and {1} within the string to insert the pivot key and the aggregation expression into the column name.

Pivot Key Values

Specifies a comma-delimited list of pivot key values.

Aggregations

Specifies an expression defining the aggregations to apply to the pivot value fields.

Ports

Input Ports

0 - Source data

Output Ports

0 - Pivoted data

Run JavaScript

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RunJavaScript Operator.

The Run JavaScript node lets you provide JavaScript logic to be executed for every input record. This node is useful when you want to transform the input data in some way and a node does not exist that provides that transformation.

You can also supply a script to run before the first record is handled and after the last record has been processed. The first can be used to initialize variables, connect to a database, define a function or other initialization steps. The script run after the last record can be used to clean up as needed. Neither of these two scripts can access the record data.

You define output fields through the Schema Editor, setting field name and data type. You can optionally import existing schemas or export schema information you have created. You can then add assignment statements in the On Every Record script. The last value set on an output field is pushed to the output. You can reference input fields by name within the script.

For example, we want to calculate the difference between two input fields that are doubles and have the result pushed to an output field using JavaScript. The input fields are named field1 and field2. First, define an output field in the Schema Editor dialog. Ensure the type of the output field is set to double. For this example, the output field will be called diff. The JavaScript code for creating difference follows:

diff = field1 - field2

This is a simple example but demonstrates how to set an output field so that its values appear in the output data of the node. If you set an output variable multiple times, the last value set will be output. If an output variable is not set, it will contain the NULL value.

The field names defined in your source and target schemas are mapped as JavaScript variables during execution and as such are required to comply with JavaScript naming conventions. Variables must start with a letter or '$' character and be composed solely of word characters [a-zA-Z_0-9]. Field names that do not meet this format will be highlighted in red in the source and target schema field panes. The RunJavaScript node will attempt to rectify this at runtime by substituting a compliant alternative name that the engine can correctly resolve.

When composing your script in the editor pane, you can insert a field name by double-clicking on it from the source or target field panes.

In cases where the field name is non-compliant, the name substitiution function is automatically applied to the field name when it is inserted in the script. The substitution convention is as follows:

• Non-compliant field names are prepended with '$'

• Any non-word characters are substituted with underscore '_'

• The field name is then appended with '$'

• The ordinal position of the field in the source or target fields list is added (0-based) in order to disambiguate cases where character substitution creates duplicate field names.

Examples:

FIRST NAME = $FIRST_NAME$0

LAST_NAME = LAST_NAME

2Age = $2Age$2

Te$ter! = $Te_ter_$3

Dialog Options

Before First Record Script

Specifies any script code to be executed before the first data record is processed. This is the place to put initialization code.

On Every Record Script

Specifies any script code to be evaluated once per input record. Any values you set on output fields will be pushed to the output data record.

After Last Record Script

This script is executed after the last data record is processed. Put your clean up code here.

Disable Parallelism

When checked, instructs the DataFlow engine to run a single instance of this operator.

Note: This can have a significant performance impact as data is “fanned in” in order to execute on a single node. Use this only when your script logic requires running in a non-parallel context.

Validate Script

Compiles the given snippet of JavaScript based on the selected editor tab and captures any warnings. Any errors will cause an exception to be issued. The detailed information about the error will be displayed in the lower pane of the editor.

Schema Editor

Launches the Schema Editor dialog to define output fields that will contain the results of the transformations in your script. Output fields must be defined either explicitly or by importing a schema.

Ports

Input Ports

0 - Input data

Output Ports

0 - Transformed data modified by user defined script.

Run R Snippet

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RunRScript Operator to Invoke R Scripts.

The Run R Snippet node allows writing R code to process the input data, producing a result that will be pushed to the output. The R script will be passed a data frame in the variable named R. Each field of the input data set will be represented as a column in the input data frame. The R code can use the data frame containing the input data as needed. The results should be placed back into a data frame using the R variable name again.

Note that all of the input data will be gathered and loaded into the R environment at run time. This implies the data must fit into memory within the R memory space.

The R installation must be installed and configured on every machine that will execute this node. When run in distributed mode, the R installation must be installed on each worker node using a consistent installation path. The node requires that the path to the Rscript executable within the R installation be set. The Rscript executable is invoked to run the R code.

This node is parallelized by default. This implies that the R code contains no data dependencies as multiple instances of the node will be executed at run time. Each instance will handle a sub-set of the input data depending on the current distribution and ordering of the data. If your R code has data dependencies and cannot be run in parallel (or distributed) then check the Disable parallelism box. Disabling parallelism will likely have a negative affect on performance.

The output schema (type) of the node must be set if the R code outputs a data frame with a different schema than the input data frame. Create new output fields and specify their types to set the output schema. Your R code must output a data frame that matches the defined output schema. If a mismatch is found, a run time error will occur.

Two variables are preset into the R environment. The variable partitionID is a zero-based identifier of the partition containing the current instance of the R snippet operator. The variable partitionCount specifies the total number of data partitions in the current execution environment. These variables are both numeric and can be used when partition information is needed within the user provided R script.

The field names defined in your source and target schemas are mapped as R variables during execution and as such are required to comply with R naming conventions. Variables must comply with the following regex or undergo automatic variable name remapping: ^[\\.]?[a-zA-Z_]+[\\.0-9a-zA-Z_]*$. Field names that do not meet this format will be highlighted in red in the source and target schema field panes. The Run R Snippet node will attempt to rectify this at run time by substituting a compliant alternative name that the engine can correctly resolve.

When composing your script in the editor pane, you can insert a field name by double-clicking on it from the source or target field panes. In cases where the field name is non-compliant, the name substitiution function is automatically applied to the field name when it is inserted in the script. The substitution convention is as follows:

• Non-compliant field names are prepended with '._'

• Any non-word characters or periods are substituted with underscore '_'

• The field name is then appended with '.'

• The ordinal position of the field in the source or target fields list is added (zero-based) in order to disambiguate cases where character substitution creates duplicate field names.

Examples

FIRST NAME= ._FIRST_NAME.0

LAST_NAME = LAST_NAME

2Age = ._2Age.2

Te$ter! = ._Te_ter_.3

Dialog Options

Path to Rscript

Specifies the file system path to the Rscript executable. This executable must be used to run the given R code. The R executable is used for interactive sessions but is not used for executing a single script of R code. The Rscript executable can be found in the installation of R in the bin directory. This property is required.

Disable Parallelism

When checked, instructs the DataFlow engine to run a single instance of this operator. Note that this can have a significant performance impact as data is “fanned in” in order to execute on a single node. Use this only when your script logic requires running in a non-parallel context.

Enable Full Data Distribution

When checked, enables full data distribution for the input data to the script node. Usually, this option should not be enabled. Use this option to ensure that every replication of the scripting operator sees all of the input data. This is needed, for instance, when the script is using vertical partitioning to have each instance work on a different set of columns of the input data. In this case, each data stream must contain all of the input rows for the results to be accurate.

Output Fields

Define output fields that will contain the results of your transformations in your script. Output fields must be defined. Specify the output field name and type. Defaults are provided.

Script body

This script will be evaluated using the R engine for the input data. Use the R variable to process the input data. Set the R variable to contain the desired output.

Ports

Input Ports

0 - Input data

Output Ports

0 - Transformed data modified by R snippet

Run Script

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RunScript Operator.

The Run Script node allows you to provide a script that will be executed for every input record. The language of the script can be selected. Several of the primary scripting languages available on the JVM are supported. This node is useful when you want to transform the input data in some way and a node does not exist that provides that transformation.

You can also supply a script to run before the first record is handled and after the last record has been processed. The first can be used to initialize variables, connect to a database or other initialization steps. The script run after the last record can be used to clean up as needed. Neither of these two scripts can access the record data.

Define output fields by adding a new output field, setting its name and type. You can then add assignment statements in the “On Every Record” script. The last value set on an output field is pushed to the output. You can reference input fields by name within the script.

For example, we want to calculate the difference between two input fields that are doubles and have the result pushed to an output field using JavaScript. The input fields are named field1 and field2. First, define an output field in the configuration dialog. Ensure the type of the output field is set to double. For this example, the output field will be called diff. The JavaScript code for creating difference follows:

diff = field1 - field2

Dialog Options

Language

Specifies the language of the scripts. Select one of the supported languages. JavaScript is the default.

Disable Parallelism

When checked, instructs the DataFlow engine to run a single instance of this operator.

Note: This can have a significant performance impact as data is “fanned in” in order to execute on a single node. Use this only when your script logic requires running in a non-parallel context.

Output Fields

Defines output fields that will contain the results of your transformations in your script. Output fields must be defined. Specify the output field name and type. Defaults are provided.

Before First Record Script

Specifies the script code to be executed before the first data record is processed. This is the place to put initialization code.

On Every Record Script

This script will be evaluated once per input record. Any values you set on output fields will be pushed to the output data record.

After Last Record Script

This script is executed after the last data record is processed. Put your clean-up code here.

Ports

Input Ports

0 - Input data

Output Ports

0 - Transformed data modified by user defined script.

Sort

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Sort Operator to Sort Data Sets.

Sorts input data based on configured source fields and their sort order. For ascending order, null values sort higher than non-null. For descending order, null values sort lower than non-null.

The Sort node may not be needed, since other nodes often explicitly specify data distribution and data ordering. For example, the Join node has a Hash Join setting that takes care of sorting input, so you do not need to insert a sort upstream. However, you may need a sort before nodes such as Run JavaScript, where execution of JavaScript code may have a data order dependency that the DataFlow execution environment is not aware of.

The Sort node is commonly used when results are written to a file, by inserting it just before the final writer node to achieve the needed output order.

Dialog Options

Field Name

Source field name used for sorting.

Sort Order

Ascending or Descending Sort order of the specified field. Null values sort higher than non-null values under ascending order, lower under descending order.

Ports

Input Ports

0 - Input

Output Ports

0 - Output port containing the results of the sort operation.

Split Field

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SplitField Operator to Split Fields.

Splits a string field into multiple fields using a specified delimiter. The output fields are specified by mapping split indices to field names.

Dialog Options

Split Field

Specifies the name of the string field to be split.

Split Pattern

Specifies a regular expression pattern as the delimiter.

Result Mapping

Maps split indices to output fields. For example, if splitting a date field formatted as "m/d/y" on "/", a mapping could be 0 => Month, 1 => Day, 2 => Year.

Ports

Input Ports

0 - Source data

Output Ports

0 - Source data with result fields added

Substring

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.

Creates length-based substrings from an existing string fields and overlays them on the output. Setting the Target field name value to an existing Source field will effectively overwrite that Source field with the resultant substring. A null input value results in a null output value.

Dialog Options

Source

Specifies the source field on which you wish to perform the substring operation.

Length

Specifies the length of the substring you wish to create.

Offset

Specifies the starting index for the substring. Default: 0 - begining of string.

Target

Specifies the name of the field that will contain the substring. If this is the same as an existing Source field, it will effectively overwrite that field with the resultant substring.

Ports

Input Ports

0 - Input

Output Ports

0 - Output containing substring fields

Time Difference

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.

The Time Difference node provides time difference calculations that you configure. Several calculations can be created to be executed by the node. The calculations may be configured to compare two input fields or an input field and a constant value or the current time at execution. Two time values must be provided, a start time and an end time. The calculation will compute the difference between the end time and the start time. You must specify the value type for the start and the end times. Valid values are:

• FIELD: a field from the input data record is selected

• CONSTANT: a constant date/time value is provided

• NOW: the current time at execution will be used

The start and end values must be of compatible types. A difference cannot be taken between a date and a time of day. If one value represents a date or time of day, the other may be a timestamp, in which case the timestamp is truncated to the matching type based on the default time zone.

Dialog Options

Start Time Value Type

Specifies the type of input value to use for the start time. The input value may be a field of the input data, a constant value, or the current time at execution.

Start Time Input Field

When the value type is specified as FIELD an input field is selected to provide the start time values.

Start Time Constant Value

When the value type is specified as CONSTANT the value of this option is used as the start time. The value must be in the ISO date, time, or timestamp format, as is appropriate.

End Time Value Type

Specifies the type of input value to use for the end time. The input value may be a field of the input data, a constant value, or the current time at execution.

End Time Input Field

When the value type is specified as FIELD, an input field is selected to provide the end time values.

End Time Constant Value

When the value type is specified as CONSTANT, the value of this option is used as the end time. The value must be in the ISO date, time, or timestamp format, as is appropriate.

Granularity

The granularity of the result of the time difference calculation. Only granularities which make sense for the given values are displayed. For dates, granularities can be no smaller than a day. For time of day, granularities can be no larger than an hour.

Output Field

Specifies the name to give the field in the output data containing the result of this calculation. This name must be unique in the namespace of the output data record.

Scale

Specifies the fractional scale of the output data values. The default is a scale of zero, providing no fractional values, but data values rounded to the nearest integer value. A scale of one would provide a single digit to the right of the decimal point and so on.

Ports

Input Ports

0 - Input data

Output Ports

0 - Original data plus calculated time differences.

Trim Whitespace

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.

Trims leading and trailing whitespace from the specified String fields in the Input record flow.

Dialog Options

Include/Exclude

Selection of fields on which to trim whitespace.

Ports

Input Ports

0 - Input

Output Ports

0 - Input data with trimmed String fields.

Type Conversion

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.

The Type Conversion node constructs a new record flow with the original input fields converted into the specified types. Only supported type conversions are available for use.

The formatting pattern that should be used by the conversion function can also be specified. If the formatting pattern is not specified, the default conversion and formatting behavior will be used. The converted fields will have the same names as the original fields.

Dialog Options

Conversions

Table used to define the various type conversion available:

• Field Name - Specifies the name of the field from the input data.

• Original Type - Specifies the type of the field in the input data.

• New Type - Specifies the type to convert the field into.

• Format Pattern - Specifies the format pattern that will be used for the conversion, if applicable.

Format Patterns

Various type conversions support different format patterns to cast the data. These are most commonly used when casting to or from string types.

• Boolean - When converting to a Boolean you must provide the unquoted "truth, falsity" values separated by a comma, for example, true, false

• Date/Timestamp - When converting to or from date types you may use a format string defining the Java DateFormat, for example, yyyy.MM.dd HH:mm:ss Z

• Enum - When converting to an enum you must provide a comma separated list of the unquoted enumerable string values, for example, Monday, Tuesday, Wednesday, Thursday, Friday

• Numeric - When converting to or from numeric types you may use a format string defining the Java DecimalFormat, for example, ##0.#####E0

• String - You may choose to format a string field with one of these options:

– TRIMMED - Trims whitespace from the string.

– UPPERCASE - Converts all characters to uppercase.

– LOWERCASE - Converts all characters to lowercase.

Add

Adds a new conversion.

Remove

Removes the currently selected conversion.

Remove All

Removes all current conversions.

Ports

Input Ports

0 - Input

Output Ports

0 - Output with the selected fields converted to the desired types

Analytics Nodes

Association Rules

ARM Model Converter

ARM Model Converter converts a PMML model containing association modeling results into the selected format.

FP-growth

FP-growth mines input transactions for frequent item sets and association rules using the FP-growth algorithm.

Frequent Items

Frequent Items discovers frequent items in a dataset containing items segregated by transactions.

Classifiers

Decision Tree Learner

Decision Tree Learner creates a Decision Tree PMML model for the given input data.

Decision Tree Predictor

Decision Tree Predictor performs classification of input data based on a Decision Tree PMML model.

Decision Tree Pruner

Decision Tree Pruner prunes a Decision Tree Model.

K-Nearest Neighbors Classifier

K-Nearest Neighbors Classifier classifies or predicts unlabeled data using the k-nearest neighbors algorithm.

Naive Bayes Learner

Naive Bayes Learner creates a Naive Bayes PMML model for the given input data.

Naive Bayes Predictor

Naive Bayes Predictor performs classification of input data based on a Naive Bayes PMML model.

SVM Learner

SVM Learner builds a support vector machine model.

SVM Predictor

SVM Predictor performs classification based on a support vector machine model.

Clustering

Cluster Predictor

Cluster Predictor assigns input data to clusters based on the PMML clustering model.

k-Means

k-Means computes k-Means clustering.

Regression

Linear Regression (Learner)

Linear Regression (Learner) performs linear regression learning.

Logistic Regression (Learner)

Logistic Regression (Learner) performs logistic regression learning.

Logistic Regression (Predictor)

Logistic Regression (Predictor) predicts a target value using a previously built logistic regression model.

Regression (Predictor)

Regression (Predictor) predicts a target value using a previously built regression model.

Viz

Diagnostics Chart Drawer

Diagnostics Chart Drawer enables the building of diagnostic charts.

Transformations Nodes

Aggregate Nodes

Cross Join

Cross Join performs a cross join of two data sets.

Group

Group aggregates data values using aggregation functions based on groups of data defined by key fields.

Join

Join performs joining of two datasets by one or more keys.

Union All

Union All performs a union of two flows.

Filter Nodes

Filter Existing Rows

Filter Existing Rows performs filtering of a dataset based on intersection with another dataset.

Filter Rows

Filter Rows filters rows based on defined field predicates.

Limit Rows

Limit Rows selects a subset of the input dataset within a specified range of positions.

Random Sample

Random Sample randomly samples input data.

Remove Duplicates

Remove Duplicates removes duplicate rows based on the defined set of group keys.

Select Fields

Select Fields selects (filter) fields in a record flow.

Manipulation Nodes

Assert Sorted

Assert Sorted asserts the input data is already sorted.

Columns To Rows (Unpivot)

Columns To Rows - Unpivot unpivots data from a wider representation (columns) into a narrow representation (rows).

Date Value Extraction

Date Value Extraction extracts individual fields from a date or timestamp data field.

Derive Fields

Derive Fields derives new fields from existing fields, using specified expressions.

Missing Value

Missing Value replaces missing values in the input dataset according to the configured actions.

Normalize Values

Normalize Values normalizes values for the selected input fields.

Partition Data

Partition Data forces data to be repartitioned.

Randomize Order

Randomize Order reorders input in a random fashion.

Rank Fields

Rank Fields ranks data fields using the given rank mode.

Regular Expression

Regular Expression performs Regular Expression operations on text input.

Rows to Columns (Pivot)

Rows To Columns - Pivot pivots data from a narrow representation (rows) into a wider representation (columns).

Run Javascript

Run JavaScript runs the Javascript logic for every record of the input.

Run R Snippet

Run R Snippet run an R script that accepts the input data and produces a result.

Run Script

Run Script runs a script that is invoked for every record in the input data.

Sort

Sort sorts the input data.

Split Field

Split Field splits a string field into multiple fields using a specified delimiter.

Substring

Substring creates substrings from existing string fields.

Time Difference

Time Difference performs time difference calculations on input data fields.

Trim Whitespace

Trim Whitespace trims leading and trailing whitespace from the specified String fields in the Input record flow.

Type Conversion

Type Conversion allows conversion of fields to compatible types.

Data Explorer

Data Quality Analyzer

Data Quality Analyzer analyzes data quality.

Data Summarizer

Data Summarizer provides data summary statistics.

Data Summarizer Viewer

Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.

Distinct Values

Distinct Values computes all distinct values and their counts for a given column.

Data Matcher

Cluster Duplicates

Cluster Duplicates clusters records from the Discover Duplicates node into groups of similar records.

Cluster Links

Cluster Links clusters records from the Discover Links node into groups of similar records.

Discover Duplicates

Discover Duplicates discovers duplicate records within a data source using fuzzy matching algorithms.

Discover Links

Discover Links discovers duplicate records between two data sources using fuzzy matching algorithms.

Encode

Encode provides a library of phonetic algorithms used for indexing of words by their pronunciation.

Text Processing

Preprocessing

Convert Case

Convert Case performs case conversions on tokenized text.

Dictionary Filter

Dictionary Filter filters the words included from a tokenized text column into the dictionary input.

Length Filter

Length Filter filters the tokens in the tokenized text column based on the length.

Punctuation Filter

Punctuation Filter filters the punctuation tokens in a tokenized text column.

Regex Filter

Regex Filter filters the tokens in a tokenized text column based on a regular expression.

Text Stemmer

Text Stemmer stems the tokenized text.

Text Tokenizer

Text Tokenizer tokenizes a string field as an object that is used for processing the text.

Word List Filter

Word List Filter filters the tokens in a tokenized text column based on list of words.

Statistics

Calculate N-grams

Calculate N-grams creates a list of n-grams using the specified n from the TextToken input column.

Calculate Word Frequencies

Calculate Word Frequencies calculates the frequency of unique word token in a tokenized text input column.

Count Tokens

Count Tokens counts the number of a particular text element token type.

Expand Frequency

Expand Frequency expands a word frequency or n-gram frequency field.

Expand Text Tokens

Expand Text Tokens expands a TextToken input column by token type.

Frequency Filter

Frequency Filter filters the frequencies in the frequency field.

Last modified date: 01/03/2025