Using DataFlow in KNIME : DataFlow Nodes in KNIME : Transformation Nodes
 
Share this page                  
Transformation Nodes
Aggregate Nodes
Cross Join
Group
Join
Union All
Filter Nodes
Filter Existing Rows
Filter Rows
Limit Rows
Random Sample
Remove Duplicates
Select Fields
Manipulation Nodes
Assert Sorted
Columns To Rows - Unpivot
Date Value Extraction
Derive Fields
Missing Value
Normalize Values
Partition Data
Randomize Order
Rank Fields
Regular Expression
Rows To Columns - Pivot
Run JavaScript
Run R Snippet
Run Script
Sort
Split Field
Substring
Time Difference
Trim Whitespace
Type Conversion
Aggregate Nodes
Cross Join
Cross Join performs a cross join of two data sets.
Group
Group aggregates data values using aggregation functions based on groups of data defined by key fields.
Join
Join performs joining of two datasets by one or more keys.
Union All
Union All performs a union of two flows.
Cross Join
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CrossJoin Operator to Cross Products.
The Cross Join node combines two input flows and produces the cartesian product of the sets. The output is typed using the merge of the two input types. If the right type contains a field already named in the left type, it will be renamed to avoid collision.
Use this node with caution. It produces a product of its two input data sets. Even with fairly small data sets, the amount of output data can be large.
Dialog Options
Row Buffer Size
Specifies the size (in rows) of the memory buffer used to cache input rows for join processing. Larger values can increase performance due to decreased intermediate file buffering.
Ports
Input Ports
0 - Input port representing the Left data set.
1 - Input port representing the Right data set.
Output Ports
0 - Output port containing the results of the cross join operation.
Group
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Group Operator to Compute Aggregations.
The Group node aggregates data within key groups. A set of aggregation functions is available, including average, count, minimum value, maximum value, standard deviation, and others. If no key fields are provided, the aggregation functions are applied to all of the data as one large group.
The node uses groups of consecutive equal keys ("key groups") to determine which data values to aggregate. The input data need not be sorted; if it is already sorted, performance will be optimal.
The output of the Group node will include the key fields (if specified) and the outcome of each aggregation function. A row of output is generated for each distinct set of key values. If no key fields are specified, a single row will be output with the aggregation results for all input rows.
Dialog Options
Key Fields
Specifies the fields to use as keys that define key groups for aggregations. If using the sort mode, the data must be sorted by the given key fields. Key fields are not required. If key fields are not supplied, the data will be aggregated as a single key group.
Aggregations
Aggregation functions can be applied to data fields within the input data source. Click the Add button to create a new aggregation operation. Select an aggregation function and the input field. Some aggregations such as correlation require two input fields.
Ports
Input Ports
0 - Input data to aggregate.
Output Ports
0 - Output data containing aggregation results.
Join
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Join Operator to Do Standard Relational Joins.
The Join node joins two inputs by the specified key fields. Inner joins and the various types of outer joins are supported by specifying the join type.
In all join modes, repeated key values on the left and right sides are supported. If a key group contains multiple rows on both sides, then all combinations of left and right rows in the key group appear in the output for that key group. Note that in this case, memory consumption is lower if the side with more key repetition is linked to the left input.
When using the Hash Join option, link the smaller of the two data sources to the right input to reduce memory consumption. When using Hash Join, there is no requirement to sort the input data.
Dialog Options
Join type
Specifies the type of join to execute: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER.
Hash Join
Uses the join operator that creates an in-memory index from the right side input. Using this option requires all of the data from the right side input to fit into memory. If not using hash join, both left and right sides will be sorted if they are not already.
Merge key fields
Sets whether to merge the left and right key fields into a single field for output.
Filter Inputs
Opens a dialog that lets you set which fields from the left and right inputs will be pre-filtered before the join.
Left key fields
Specifies the fields of the left input data set to use as the key fields in the join operation.
Right key fields
Specifies the fields of the right input data set to use as the key fields in the join operation.
The Join Condition can optionally be set using the Simple Predicate view or Predicate Expression view.
In the Simple Predicate view:
Field conditions
Use the Add and Remove buttons to add or remove logical predicates from the criteria list.
All conditions true
Specifies that all the predicates must be true for the join condition.
Any conditions true
Specifies that at least one predicates must be true for the join condition.
In the Predicate Expression view:
Insert Operator
Adds a selected operator to the expression at the current edit position.
Insert Function
Adds a selected function to the expression at the current edit position.
Insert Field Reference
Adds a selected field to the expression at the current edit position.
Check Expression
Validates the expression text, checking whether it is syntactically correct and evaluates to a Boolean value.
Reset
Restores the expression to the one defined in the Simple Predicate view.
Ports
Input Ports
0 - Input port representing the Left data set.
1 - Input port representing the Right data set.
Output Ports
0 - Output port containing the results of the join operation.
Union All
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the UnionAll Operator to Create Union of Data Sets.
The Union All node combines two input flows into a single flow. Usually no guarantee is provided as to the ordering of the rows in the resulting flow. The operator attempts to provide data with the minimum of delay, following a first-come, first-served policy as closely as possible. However, if the data order specified in the metadata of both input flows is the same, then the node will attempt to preserve the sorted data order. The metadata will have its data order specified if the flow has gone through a Sort or Assert Sorted node.
The output type can be provided by using the Schema Editor. If a schema is not provided, the output type, including the field names, can be automatically derived based on the settings provided.
If a schema is provided, the inputs will be manipulated to match the output schema. The fields are mapped from the inputs to the output by field name. Fields contained in an input but not in the output will be dropped. Fields contained in the output but not in an input will contain null values. The input fields will be converted to their matching output field type as needed. If a conversion is not supported, the configuration will fail.
For example, if an input field has the type string and the output type is a date, the input values will be converted according to the schema specification for the output field. In the case of a date, a pattern to use for the conversion can be provided. If one is not provided, a default pattern will be applied.
Dialog Options
Automatically Configure Schema
Automatically determines the appropriate output schema based on the selected options.
Map by Postion
Automatically maps the fields in the inputs by position.
Map by Name
Automatically maps the fields in the inputs by name.
Include Extra Fields
Includes extra fields that are only present in one side of the input when automatically determining the output schema. Otherwise, extra fields will not be included in the output.
Manually Configure Schema
Enables the Schema Editor for manual configuration of the output schema. This enables the schema preview.
Schema Editor
Used to define the structure of the output schema, identifying the field names and types.
Name
Defines the name of the selected field. Field names must be unique.
Type
Defines the data type of the selected field. Additional refinement of the type describing the allowed values and format will be editable in the Field Details, as is appropriate.
Default Null Indicator
Defines the default string value used to indicate a null in text. A number of commonly used options are predefined for selection.
Generate Schema
Replaces the current schema with a default one based on the available input fields.
Load Schema
Reads a schema from a file, replacing the current schema.
Save Schema
Writes the current schema to a file for future use. The schema must currently be in a valid state before it can be saved.
Field Details
Used to refine the definition of the currently selected schema field.
Format
Defines the text formatting for the selected field, if allowed for the selected field type. A number of commonly used formats are predefined for selection. Formatting is type-dependent:
Appropriate formats for dates and timestamps are any supported by SimpleDateFormat.
Appropriate formats for numeric types are any supported by NumberFormat.
Strings do not support custom formats, although they do support choosing whether whitespace should be trimmed.
If default formatting for the type should be used, select Use default format. For numeric types, default formatting is default Java formatting. For dates and timestamps, default formatting follows ISO-8601 formatting.
Null Indicator
Defines the string value indicating a null in the text for the selected field. A number of commonly used values are predefined for selection.
If the schema default should be used, select Use default null indicator.
Allowed Values
If the selected field type is enumerated, specifies the possible values that the enumeration can take.
Ports
Input Ports
0 - Left Input
1 - Right Input
Output Ports
0 - The union of the two inputs
Filter Nodes
Filter Existing Rows
Filter Existing Rows performs filtering of a dataset based on intersection with another dataset.
Filter Rows
Filter Nodes filters rows based on defined field predicates.
Limit Rows
Filter Nodes selects a subset of the input dataset within a specified range of positions.
Random Sample
Filter Nodes randomly samples input data.
Remove Duplicates
Remove Duplicates removes duplicate rows.
Select Fields
Select Fields selects (filters) fields in a record flow.
Filter Existing Rows
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterExistingRows Operator to Filter by Data Set Membership.
Filters the left input based on the presence of key values in the right input. Performs both a semi-join and anti-join of the data.
In the semi-join, only rows from the left that match one or more rows on the right using the keys are output, whereas in the antijoin, only rows from the left that match no rows on the right are output.
Dialog Options
Hash Join
Use the join operator that creates an in-memory index from the right side input. Using this option requires all of the data from the right hand side input to fit into memory. If not using hash join, both left and right sides will be sorted if they are not already.
Left key fields
Specifies the fields of the left input dataset to use as the key fields in the join operation.
Right key fields
Specifies the fields of the right input dataset to use as the key fields in the join operation.
The Join Condition can optionally be set using the Simple Predicate view or Predicate Expression view.
In the Simple Predicate view:
Field conditions
Use the Add and Remove buttons to add or remove logical predicates from the criteria list.
All conditions true
All the predicates must be true for the join condition.
Any conditions true
At least one predicate must be true for the join condition.
In the Predicate Expression view:
Insert Operator
Adds a selected operator to the expression at the current edit position.
Insert Function
Adds a selected function to the expression at the current edit position.
Insert Field Reference
Adds a selected field to the expression at the current edit position.
Check Expression
Validates the expression text, checking whether it is syntactically correct and evaluates to a Boolean value.
Reset
Restores the expression to the one defined in the Simple Predicate view.
Ports
Input Ports
0 - Input port representing the left dataset.
1 - Input port representing the right dat set.
Output Ports
0 - Output port containing the results of the semi-join operation.
1 - Output port containing the results of the anti-join operation.
Filter Rows
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterRows Operator to Filter by Predicate.
Filters rows from the source input based on the predicate defined in the configuration. Records passing the predicate—those for which the predicate evaluates to true—are emitted on the standard output.
Records that fail the predicate are emitted on the rejected output.
The filter predicate can be defined in two different ways. In the Simple Predicate view, the filter is defined as a compound statement of simple predefined field predicates. In the Predicate Expression view, the predicate is defined using a built-in expression language, allowing for arbitrarily complex predicates to be defined. The Expression Language is defined at Expression Language.
To use the Simple Predicate view, select the appropriately named tab. The field predicates are displayed as a table. The first column chooses the field to which the predicate applies. The second chooses the predicate. The third provides an additional argument to the predicate; some predicates only require a single input, in which case the third column is not editable for the row. When editing the third column:
If the predicate contains FIELD label, then pick a field in the input data on the right-hand dropdown.
If the predicate contains VALUE label, then enter a compatible value in the right-hand text field.
The individual field predicates are combined to produce the filter predicate according to the setting of the radio buttons below the table. If the All Conditions True option is selected, all the predicates must be true; the field predicates are combined using a logical and. If the Any Conditions True option is selected, at least one predicate must be true; the field predicates are combined using a logical OR.
To use the Predicate Expression view, select the appropriately named tab. A text area will be displayed in which you can enter an expression. If you are moving to this view from the Simple Predicate view, the expression is initially equivalent to the one defined in the simple predicate table. Because the simple view cannot represent all predicates, when you are editing the expression in the Predicate Expression view, the simple predicate tab is disabled. It can be reenabled only if:
The expression is empty.
The expression matches the original expression in the simple view. The expression can be reset to this value at any time by clicking the Reset button.
Dialog Options
In the Simple Predicate view:
Field conditions
Use the Add and Remove buttons to add or remove logical predicates from the criteria list.
All conditions true
All the predicates must be true for the record to be pushed to the output port.
Any conditions true
At least one predicate must be true for the record to be pushed to the output port.
In the Predicate Expression view:
Insert Operator
Adds a selected operator to the expression at the current edit position.
Insert Function
Adds a selected function to the expression at the current edit position.
Insert Field Reference
Adds a selected field to the expression at the current edit position.
Check Expression
Validates the expression text, checking whether it is syntactically correct and evaluates to a Bboolean value.
Reset
Restores the expression to the one defined in the Simple Predicate view.
Ports
Input Ports
0 - Input
Output Ports
0 - Output rows that met the conditions of the configuration.
1 - Output rows that did not meet the conditions of the configuration. This is the inverse of the output data.
Limit Rows
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LimitRows Operator to Limit Output Rows.
Select a subset of the rows from the input based on an initial offset and maximum number. Only rows that are positioned within the specified range are output.
Dialog Options
Start with row
Specifies the position of the first row in the input set to output. The count of rows begins with this row. Input rows before this position are discarded.
Maximum number of rows
Specifies the maximum number of rows from the input set to output. Once this many rows have been output, any remaining input rows are discarded.
Ports
Input Ports
0 - Input
Output Ports
0 - Output rows between the specified positions.
Random Sample
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SampleRandomRows Operator to Sample Data.
This node randomly samples rows from the input data set. Selected rows are pushed to the output of the node. Two modes for sampling are supported: Relative and Absolute.
Dialog Options
Sample Mode
Two modes are supported:
Relative
Uses a given percentage of rows from the total data. The number of output rows depends on the number of input rows. A larger number of input rows raises the sample size.
Absolute
Uses a given sample size for the number of output rows. A larger number of input rows does not change the sample size.
Percentage
Specifies the percentage of input data to output. The value must be between 0.0 and 1.0 (exclusive). This setting is used only in Relative mode.
Sample Size
Specifies the number of rows wanted in the sample output. The actual number of rows may vary because of the random nature of the sampling. This setting is used only in Absolute mode.
Random Seed
Specifies a seed value for the random number generator. Different seed values can make the output vary slightly in Absolute mode.
Ports
Input Ports
0 - Input data to sample
Output Ports
0 - Sampled data
Remove Duplicates
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RemoveDuplicates Operator to Remove Duplicates.
The Remove Duplicates node removes duplicate rows based on a specified set of group keys. The “first” record of a key value group is pushed to the output. Other records with the same key values are ignored. The “first” record of a key group is determined by sorting all rows of each key group by the specified sort keys. If the sort keys are unspecified, then an arbitrary row is output.
Dialog Options
Group Keys
Specifies the fields used as the group keys.
Sort Keys
Specifies the fields used as sort keys and the sort ordering.
Ports
Input Ports
0 - Input data
Output Ports
0 - Output rows with duplicates removed
Select Fields
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SelectFields Operator to Select Fields.
Constructs a new record flow with a subset of fields from the source record. Use the Selection check boxes to include or exclude fields, and use the Fields (renamed) text boxes to rename them. The resulting record flow has exactly those fields selected and is in the same order as specified in the selection table, with the new names where applicable.
Dialog Options
Fields Selection
Specifies the selection and naming of fields to include in or exclude from the output:
1. Selection - Check to include or clear to exclude a given field.
2. Field (original) - Specifies the original name of the field from the input data.
3. Field (renamed) - Specifies the corresponding renamed field for the output result.
Operations
Standard operations on field selection:
1. Move Up - Moves the selected fields up by one position.
2. Move Down - Moves the selected fields down by one position.
3. Invert Selection - Inverts (check or clear) the inclusion status of selected fields.
4. Invert All - Inverts (check or clear) the inclusion status of all fields.
5. Select All - Includes all fields.
6. Select None - Excludes all fields.
7. Restore Defaults - Resets the configuration to the original state (all fields included with their original names). Warning: This action removes all custom changes.
Ports
Input Ports
0 - Input
Output Ports
0 - Output with the selected fields in the exact order and names as configured.
Manipulation Nodes
Assert Sorted
Assert Sorted asserts the input data is already sorted.
Columns To Rows
Columns To Rows - Unpivot performs an unpivot of the input data, transposing values from the columns into multiple rows.
Date Value Extraction
Date Value Extraction extracts individual fields from a date or timestamp data field.
Derive Fields
Derive Fields derives new fields from existing fields, using specified expressions.
Missing Value
Missing Value replaces missing values in the input dataset according to the configured actions.
Normalize Values
Normalize Values normalizes values for the selected input fields.
Partition Data
Partition Data forces data to be repartitioned.
Randomize Order
Randomize Order reorders input in a random fashion.
Rank Fields
Rank Fields ranks data fields using the given rank mode.
Regular Expression
Regular Expression performs Regular Expression operations on text input.
Rows To Columns:
Rows To Columns - Pivot performs a pivot of the input data, applying the specified aggregations.
Run Javascript
Run JavaScript runs JavaScript logic for every record of the input.
Run R Snippet
Run R Snippet runs an R script that accepts the input data and produces a result.
Run Script
Run Script runs a script that is invoked for every record in the input data.
Sort
Sort sorts the input data.
Split Field
Split Field splits a string field into multiple fields, using a specified delimiter.
Substring
Substring creates substrings from existing string fields.
Time Difference
Time Difference performs time difference calculations on input data fields.
Trim Whitespace
Trim Whitespace trims leading and trailing whitespace from the specified String fields in the Input record flow.
Type Conversion
Type Conversion allows conversion of fields to compatible types.
Assert Sorted
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the AssertSorted Operator to Assert Data Ordering.
This node asserts that input data is already sorted based on configured source fields and their sort order. Null values sort higher than non-null values under ascending order, lower under descending order. If the data is not sorted, the node will fail during execution.
Using this operator can help avoid unnecessary sorts in nodes connected to the output, as consumers are guaranteed that the data they receive is sorted as specified.
Dialog Options
Field Name
Specifies the source field name expected to be sorted.
Sort Order
Specifies Ascending or Descending sort order of the specified field. Null values sort higher than nonnull values under ascending order, lower under descending order.
Ports
Input Ports
0 - Input port containing already sorted data.
Output Ports
0 - Output port containing the same data as the input.
Columns To Rows - Unpivot
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ColumnsToRows Operator to Convert Columns to Rows (Unpivot).
The Columns To Rows node normalizes records by transposing values from columns into multiple rows. You define one or more pivot value families to describe the output fields and the source fields that will be mapped to these fields in the output. Target field datatypes are determined by finding the widest common type of the pivot elements. An exception can occur if no common type is possible. The number of elements defined in a Key Family or Value Family must be equal.
You may optionally define a single pivot key family. This lets you provide a context label when transposing rows defined in value families. You set a list of Strings that correspond positionally to the fields defined in the list portion of the defined value families. The size of all lists across both value families and the key family must be the same.
You may optionally define Group Key Fields that will be the fixed or repeating value portion of multiple rows. If this property is unset, group key fields will be determined by the remainder of the source fields not specified as pivot elements.
The following table provides an example set of data that we want to unpivot to a more normalized contact, type, address, and number format.
Contact
Home Address
Home Phone
Business Address
Business Phone
John Smith
8743 Elm Dr.
555-1235
null
555-4567
Ricardo Jockamo
1208 Maple Ave.
555-7654
123 Main St.
null
Sally White
null
null
456 Wall St.
null
To accomplish this unpivot, a pivot key family will be created with the field name of "type" and the labels "home" and "business". We will define two pivot value families with the names "address" and "number". The address family will contain the fields "HomeAddress" and "BusinessAddress", which will map to the home and business types we defined as the pivot key family. The number family will contain the fields "HomePhone" and "BusinessPhone", which will similarly map to the family types. Finally, we will set the "Contact" field as a group key so it will be included in each row. This produces the following unpivoted table.
Contact
type
address
number
John Smith
home
8743 Elm Dr.
555-1235
John Smith
business
null
555-4567
Ricardo Jockamo
home
1208 Maple Ave.
555-7654
Ricardo Jockamo
business
123 Main St.
null
Sally White
home
null
null
Sally White
business
456 Wall St.
null
If, during the columns-to-rows process, all mapped values to the pivot columns are found to be null, that row will not be produced in the output. In the example above, if Pivot Key Family was not defined, the Sally White "home" record would not have appeared in the output since both elements are null. Only because a Pivot Key Family was defined does this record appear in the output.
Dialog Options
Pivot Key Field Name
Specifies the name of the field that will contain the family labels if used.
Pivot Key Family
Specifies an ordered list of labels that will be used for each value family.
Pivot Value Families
Specifies a list of the fields that will be added to the output and the source fields that will unpivoted into each field in the output.
Group Keys
Specifies an optional list of input fields that will comprise the fixed portion of the output.
Ports
Input Ports
0 - Source data
Output Ports
0 - Unpivoted data
Date Value Extraction
 KNIME  This topic describes a KNIME node. It is based on a variation of the DataFlow operator DeriveFields, described in Using the DeriveFields Operator to Compute New Fields.
The date value extraction node extracts individual values from either a date or a timestamp data type. Multiple values may be extracted from each valid input field. Select the option on the configuration dialog to add a new extraction specification. After a new extraction is added, you can select the input field and the type of value to extract and set the output field name.
Dialog Options
Input Field
Selects an input field from which values will be extracted. Only input fields that are date or timestamp types can be selected.
Value to Extract
Specifies the value type to extract. The following values are supported:
YEAR: four-digit year
MONTH: month of year, from 1 to 12
WEEK_OF_YEAR: week of the year, from 1 to 52
HOUR_OF_DAY: hour of the day, from 0 to 23
MINUTE_OF_HOUR: minute of the hour, from 0 to 59
SECOND_OF_MINUTE: second of the minute, from 0 to 59
DAY_OF_WEEK: day of the week, from 1 to 7
DAY_OF_MONTH: day of the month, from 1 to 31
DAY_OF_YEAR: day of the year, from 1 to 366
Output Field
Specifies the name to give the output field containing the extracted value. The name must be unique and must be provided.
Ports
Input Ports
0 - Input data
Output Ports
0 - Original data plus extracted values
Derive Fields
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields.
Derives new fields from existing fields, using specified expressions.
Each output field is given an expression to be evaluated, and the result of that evaluation is placed in the output field. Input fields are passed to the output unless they are overridden or Drop Underived Fields is selected. Input fields can be overridden by creating a field derivation with the same output field name.
The expressions must follow the DataFlow expression language. This language supports standard equivalence operators (=, >, <, >=, <=), Boolean operators (and, or, not), arithmetic operators (+, -, *, /), and many more functions. For information about expression syntax and functions, see Expression Language.
Dialog Options
Field Derivations
Specifies a list of output fields and corresponding expressions. The first column specifies the output field name. The second column specifies the expression to be evaluated. If any errors exist in the field name or expression, an error icon will appear to the right of the expression. Hover over this icon for more details about the error. To add a new derivation, enter a new field name or expression in the last row.
Remove - Remove the selected field derivation.
Up - Move the selected field derivation up.
Down - Move the selected field derivation down.
Drop Underived Fields
If selected, the input fields will be dropped and only derived fields will be outputted.
Ports
Input Ports
0 - Source data
Output Ports
0 - Source data with derived fields added, or derived fields alone.
Missing Value
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ReplaceMissingValues Operator to Replace Missing Values.
The missing values node replaces missing values in the input data using the configured set of actions. Actions may be based on supported data types or by field name. If specified by type, the action is applied to all fields of the given type within the input data.
The following actions are supported:
Take no action (default)
Skip (ignore) the record containing missing values
Replace the missing value with a constant value
Replace the missing value with the mean value
Replace the missing value with the median value
Replace the missing value with the minimum value
Replace the missing value with the maximum value
Replace the missing value with the most frequent value
Note:  The definition order of the actions has significance. The actions by type are applied first. Next the actions by field name are applied. The last action that is applicable to a field is used. In this way, you may specify an action by type and override that action for a specific field by name. Only one action can be applied to any field.
Dialog Options
Actions by Column Type
Specifies the replacement actions by field type.
Actions by Column Name
Specifies the replacement actions by field name.
Ports
Input Ports
0 - Input data
1 - Optional. Summary statistics on input data for use with actions requiring statistical information (such as maximum values). If not supplied, this data is automatically calculated.
Output Ports
0 - Transformed data with missing values replaced
Normalize Values
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the NormalizeValues Operator to Normalize Values.
Normalizes values for each of the selected input fields on each input row. A new field is added to the output for each input field selected.
Statistics such as the mean and the standard deviation are calculated for each of the selected fields and used during the normalization calculation.
Dialog Options
Column Selection
Specifies the list of input fields used to calculate z-scores. Only numeric fields are supported.
Normalization Method
Specifies the method of normalization to apply. Min-Max and z-score are supported.
Ports
Input Ports
0 - Input data
1 - Optional. Summary statistics on input data to use in normalization. If not supplied, this data is automatically calculated.
Output Ports
0 - Original data plus normalized fields
Partition Data
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the PartitionHint Operator to Explicitly Partition Data.
Repartitions data across execution nodes. It can be used with the DataFlow Executor to control how DataFlow graphs are broken into sequentially executed segments. It is used for advanced performance-tuning techniques. Typically, using this node is not required.
Dialog Options
Partition Scheme
Specifies the method of partitioning data. It offers several options:
Balanced - Data is evenly redistributed to all partitions. Each partition receives roughly the same number of rows.
Hash - Data is redistributed so that records with the same values for key fields are sent to the same partition. There is no guarantee with respect to resulting partition sizes.
Range - Data is redistributed based on value ranges. Ranges are automatically estimated from the data; each partition receives roughly the same number of rows.
Partition Keys
Specifies the fields to use in the partitioning scheme. The balanced scheme is independent of field values, so does not require partition keys. Other schemes require at least one key field. The ordering of keys is significant.
Ports
Input Ports
0 - Input port containing data to repartition.
Output Ports
0 - Output port containing the same data as the input, but repartitioned.
Randomize Order
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Randomize Operator to Randomize Partitioning.
The Randomize Order node reorders the input in a random fashion. The output is identical to the input, but ordered randomly and with each record on a random partition.
Dialog Options
Random seed
Sets the random seed to use.
Ports
Input Ports
0 - Input port containing data to randomly reorder
Output Ports
0 - Output port containing the same data as the input, but randomly ordered
Rank Fields
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Rank Operator to Rank Data.
Rank Fields is a node that ranks data using the given rank mode. The data is grouped by the given partition fields and is sorted within the grouping by the ranking fields.
Dialog Options
Partition Field Selection
Field Name
Displays the fields that have been selected to partition the rankings. The top most field will have highest precedence, with following fields being given reduced precedence, based on order.
Rank Field Selection
Field Name
Displays the fields that have been selected to rank the input data. The topmost field will have highest precendence, with following fields being used to determine how to order records where the primary ranked field’s data is equal.
Sort Order
Specifies the ordering that should be used when ranking a field.
Operations
Add
Adds a field to the associated table.
Remove
Removes a field from the associated table.
Move Up
Moves the selected field up in the associated table.
Move Down
Moves the selected field down in the associated table.
Clear
Clears all fields from the associated table.
Settings
Ranking Mode
Sets the ranking mode used:
Standard - Standard or competitive ranking that leaves gaps in assigned ranks.
Dense - Dense ranking, similar to standard but does not leave gaps.
Ordinal - Ordinal ranking where each item receives a distinct rank.
Rank Output Field Name
Sets the name of the output field that will contain the ranking order for each record in the output data.
Ports
Input Ports
0 - Input
Output Ports
0 - Output data sorted by rank field and grouped by partitions.
Regular Expression
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields.
Performs regular expression operations on text input. The output from the regular expression is placed in the column specified by the Replacement Column Name field. If no field is specified, then a new column is created.
Dialog Options
Source column
Specifies the name of the column whose cells should be processed.
Pattern
Specifies a regular expression pattern.
Replacement
This text replaces the previous value in the cell if the pattern specified in Pattern field matches.
Target Column
Specifies the field to store the output, this can be the same as the source column.
Ports
Input Ports
0 - Source data
Output Ports
0 - Source data with replaced values or an additional column.
Rows To Columns - Pivot
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RowsToColumns Operator to Convert Rows to Columns (Pivot).
The Rows To Columns node is used to pivot data from a narrow representation (rows) into a wider representation (columns).
The data is first segmented into groups using a defined set of group keys. The ordering of the group keys is important, as it defines how the data is partitioned and ordered for the pivot operation. A pivot key field provides the distinct values that will be used as the pivot point. This field must be a string or enumerated type. A column is created in the output for each distinct value of the pivot key.
An aggregation is defined, which is performed on each data grouping defined by the group keys and the pivot key. The result of the aggregation for each unique value of the pivot key appears in the appropriate output column.
The following table provides an example set of data that we want to pivot by Region. There are only four regions: North, South, East, and West. For each item, we want to compute the total sales per region. Items can show up multiple times in a region because the data is also organized by store.
ItemID
StoreID
Region
Sales
1
10
North
1000
1
15
North
800
1
20
South
500
1
30
East
700
2
40
West
1200
2
10
North
500
2
15
North
200
To accomplish this pivot, the ItemID will be used as the group key. The Region will be used as the pivot key. And the Sales column will be the pivot value, aggregating by summing the values. The pivot key values are "North", "South", "East" and "West". The result of the pivot is shown in the following table.
Note that the sales total for the West region for item 1 is empty. Scanning the input data shows that no sales were present in the West region for item 1. Item 1 did have two sales values for the North region. Those values (1000 and 800) are summed and the total (1800) appears in the North region column for item 1. Values with a ? indicate a null or nonexistent value.
ItemID
North
South
East
West
1
1800
500
700
?
2
700
?
?
1200
The key concepts to understand in using the Rows To Columns node are:
Using a set of columns to segment the data into groups for pivoting. This set of columns is called the group keys. The example used the ItemID as the group key.
A categorical valued column whose distinct values are used as columns in the output. This is the pivot key. The example used the Region as the pivot key.
A column that can be aggregated for each data grouping and within each pivot key. These are the pivot values. The example used the Sales column as the pivot value.
Dialog Options
Group Keys
Specifies an ordered list of input fields to use when grouping the input data for the pivot operation.
Pivot Key
Specifies the input field to use as the pivot key. Only string fields are supported.
Pivot Column Pattern
Specifies the naming pattern that will be used for new pivot columns. Use the special variable {0} and {1} within the string to insert the pivot key and the aggregation expression into the column name.
Pivot Key Values
Specifies a comma-delimited list of pivot key values.
Aggregations
Specifies an expression defining the aggregations to apply to the pivot value fields.
Ports
Input Ports
0 - Source data
Output Ports
0 - Pivoted data
Run JavaScript
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RunJavaScript Operator.
The Run JavaScript node lets you provide JavaScript logic to be executed for every input record. This node is useful when you want to transform the input data in some way and a node does not exist that provides that transformation.
You can also supply a script to run before the first record is handled and after the last record has been processed. The first can be used to initialize variables, connect to a database, define a function or other initialization steps. The script run after the last record can be used to clean up as needed. Neither of these two scripts can access the record data.
You define output fields through the Schema Editor, setting field name and data type. You can optionally import existing schemas or export schema information you have created. You can then add assignment statements in the On Every Record script. The last value set on an output field is pushed to the output. You can reference input fields by name within the script.
For example, we want to calculate the difference between two input fields that are doubles and have the result pushed to an output field using JavaScript. The input fields are named field1 and field2. First, define an output field in the Schema Editor dialog. Ensure the type of the output field is set to double. For this example, the output field will be called diff. The JavaScript code for creating difference follows:
diff = field1 - field2
This is a simple example but demonstrates how to set an output field so that its values appear in the output data of the node. If you set an output variable multiple times, the last value set will be output. If an output variable is not set, it will contain the NULL value.
The field names defined in your source and target schemas are mapped as JavaScript variables during execution and as such are required to comply with JavaScript naming conventions. Variables must start with a letter or '$' character and be composed solely of word characters [a-zA-Z_0-9]. Field names that do not meet this format will be highlighted in red in the source and target schema field panes. The RunJavaScript node will attempt to rectify this at runtime by substituting a compliant alternative name that the engine can correctly resolve.
When composing your script in the editor pane, you can insert a field name by double-clicking on it from the source or target field panes.
In cases where the field name is non-compliant, the name substitiution function is automatically applied to the field name when it is inserted in the script. The substitution convention is as follows:
Non-compliant field names are prepended with '$'
Any non-word characters are substituted with underscore '_'
The field name is then appended with '$'
The ordinal position of the field in the source or target fields list is added (0-based) in order to disambiguate cases where character substitution creates duplicate field names.
Examples:
FIRST NAME = $FIRST_NAME$0
LAST_NAME = LAST_NAME
2Age = $2Age$2
Te$ter! = $Te_ter_$3
Dialog Options
Before First Record Script
Specifies any script code to be executed before the first data record is processed. This is the place to put initialization code.
On Every Record Script
Specifies any script code to be evaluated once per input record. Any values you set on output fields will be pushed to the output data record.
After Last Record Script
This script is executed after the last data record is processed. Put your clean up code here.
Disable Parallelism
When checked, instructs the DataFlow engine to run a single instance of this operator.
Note:  This can have a significant performance impact as data is “fanned in” in order to execute on a single node. Use this only when your script logic requires running in a non-parallel context.
Validate Script
Compiles the given snippet of JavaScript based on the selected editor tab and captures any warnings. Any errors will cause an exception to be issued. The detailed information about the error will be displayed in the lower pane of the editor.
Schema Editor
Launches the Schema Editor dialog to define output fields that will contain the results of the transformations in your script. Output fields must be defined either explicitly or by importing a schema.
Ports
Input Ports
0 - Input data
Output Ports
0 - Transformed data modified by user defined script.
Run R Snippet
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RunRScript Operator to Invoke R Scripts.
The Run R Snippet node allows writing R code to process the input data, producing a result that will be pushed to the output. The R script will be passed a data frame in the variable named R. Each field of the input data set will be represented as a column in the input data frame. The R code can use the data frame containing the input data as needed. The results should be placed back into a data frame using the R variable name again.
Note that all of the input data will be gathered and loaded into the R environment at run time. This implies the data must fit into memory within the R memory space.
The R installation must be installed and configured on every machine that will execute this node. When run in distributed mode, the R installation must be installed on each worker node using a consistent installation path. The node requires that the path to the Rscript executable within the R installation be set. The Rscript executable is invoked to run the R code.
This node is parallelized by default. This implies that the R code contains no data dependencies as multiple instances of the node will be executed at run time. Each instance will handle a sub-set of the input data depending on the current distribution and ordering of the data. If your R code has data dependencies and cannot be run in parallel (or distributed) then check the Disable parallelism box. Disabling parallelism will likely have a negative affect on performance.
The output schema (type) of the node must be set if the R code outputs a data frame with a different schema than the input data frame. Create new output fields and specify their types to set the output schema. Your R code must output a data frame that matches the defined output schema. If a mismatch is found, a run time error will occur.
Two variables are preset into the R environment. The variable partitionID is a zero-based identifier of the partition containing the current instance of the R snippet operator. The variable partitionCount specifies the total number of data partitions in the current execution environment. These variables are both numeric and can be used when partition information is needed within the user provided R script.
The field names defined in your source and target schemas are mapped as R variables during execution and as such are required to comply with R naming conventions. Variables must comply with the following regex or undergo automatic variable name remapping: ^[\\.]?[a-zA-Z_]+[\\.0-9a-zA-Z_]*$. Field names that do not meet this format will be highlighted in red in the source and target schema field panes. The Run R Snippet node will attempt to rectify this at run time by substituting a compliant alternative name that the engine can correctly resolve.
When composing your script in the editor pane, you can insert a field name by double-clicking on it from the source or target field panes. In cases where the field name is non-compliant, the name substitiution function is automatically applied to the field name when it is inserted in the script. The substitution convention is as follows:
Non-compliant field names are prepended with '._'
Any non-word characters or periods are substituted with underscore '_'
The field name is then appended with '.'
The ordinal position of the field in the source or target fields list is added (zero-based) in order to disambiguate cases where character substitution creates duplicate field names.
Examples
FIRST NAME= ._FIRST_NAME.0
LAST_NAME = LAST_NAME
2Age = ._2Age.2
Te$ter! = ._Te_ter_.3
Dialog Options
Path to Rscript
Specifies the file system path to the Rscript executable. This executable must be used to run the given R code. The R executable is used for interactive sessions but is not used for executing a single script of R code. The Rscript executable can be found in the installation of R in the bin directory. This property is required.
Disable Parallelism
When checked, instructs the DataFlow engine to run a single instance of this operator. Note that this can have a significant performance impact as data is “fanned in” in order to execute on a single node. Use this only when your script logic requires running in a non-parallel context.
Enable Full Data Distribution
When checked, enables full data distribution for the input data to the script node. Usually, this option should not be enabled. Use this option to ensure that every replication of the scripting operator sees all of the input data. This is needed, for instance, when the script is using vertical partitioning to have each instance work on a different set of columns of the input data. In this case, each data stream must contain all of the input rows for the results to be accurate.
Output Fields
Define output fields that will contain the results of your transformations in your script. Output fields must be defined. Specify the output field name and type. Defaults are provided.
Script body
This script will be evaluated using the R engine for the input data. Use the R variable to process the input data. Set the R variable to contain the desired output.
Ports
Input Ports
0 - Input data
Output Ports
0 - Transformed data modified by R snippet
Run Script
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RunScript Operator.
The Run Script node allows you to provide a script that will be executed for every input record. The language of the script can be selected. Several of the primary scripting languages available on the JVM are supported. This node is useful when you want to transform the input data in some way and a node does not exist that provides that transformation.
You can also supply a script to run before the first record is handled and after the last record has been processed. The first can be used to initialize variables, connect to a database or other initialization steps. The script run after the last record can be used to clean up as needed. Neither of these two scripts can access the record data.
Define output fields by adding a new output field, setting its name and type. You can then add assignment statements in the “On Every Record” script. The last value set on an output field is pushed to the output. You can reference input fields by name within the script.
For example, we want to calculate the difference between two input fields that are doubles and have the result pushed to an output field using JavaScript. The input fields are named field1 and field2. First, define an output field in the configuration dialog. Ensure the type of the output field is set to double. For this example, the output field will be called diff. The JavaScript code for creating difference follows:
diff = field1 - field2
This is a simple example but demonstrates how to set an output field so that its values appear in the output data of the node. If you set an output variable multiple times, the last value set will be output. If an output variable is not set, it will contain the NULL value.
Dialog Options
Language
Specifies the language of the scripts. Select one of the supported languages. JavaScript is the default.
Disable Parallelism
When checked, instructs the DataFlow engine to run a single instance of this operator.
Note:  This can have a significant performance impact as data is “fanned in” in order to execute on a single node. Use this only when your script logic requires running in a non-parallel context.
Output Fields
Defines output fields that will contain the results of your transformations in your script. Output fields must be defined. Specify the output field name and type. Defaults are provided.
Before First Record Script
Specifies the script code to be executed before the first data record is processed. This is the place to put initialization code.
On Every Record Script
This script will be evaluated once per input record. Any values you set on output fields will be pushed to the output data record.
After Last Record Script
This script is executed after the last data record is processed. Put your clean-up code here.
Ports
Input Ports
0 - Input data
Output Ports
0 - Transformed data modified by user defined script.
Sort
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the Sort Operator to Sort Data Sets.
Sorts input data based on configured source fields and their sort order. For ascending order, null values sort higher than non-null. For descending order, null values sort lower than non-null.
The Sort node may not be needed, since other nodes often explicitly specify data distribution and data ordering. For example, the Join node has a Hash Join setting that takes care of sorting input, so you do not need to insert a sort upstream. However, you may need a sort before nodes such as Run JavaScript, where execution of JavaScript code may have a data order dependency that the DataFlow execution environment is not aware of.
The Sort node is commonly used when results are written to a file, by inserting it just before the final writer node to achieve the needed output order.
Dialog Options
Field Name
Source field name used for sorting.
Sort Order
Ascending or Descending Sort order of the specified field. Null values sort higher than non-null values under ascending order, lower under descending order.
Ports
Input Ports
0 - Input
Output Ports
0 - Output port containing the results of the sort operation.
Split Field
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SplitField Operator to Split Fields.
Splits a string field into multiple fields using a specified delimiter. The output fields are specified by mapping split indices to field names.
Dialog Options
Split Field
Specifies the name of the string field to be split.
Split Pattern
Specifies a regular expression pattern as the delimiter.
Result Mapping
Maps split indices to output fields. For example, if splitting a date field formatted as "m/d/y" on "/", a mapping could be 0 => Month, 1 => Day, 2 => Year.
Ports
Input Ports
0 - Source data
Output Ports
0 - Source data with result fields added
Substring
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.
Creates length-based substrings from an existing string fields and overlays them on the output. Setting the Target field name value to an existing Source field will effectively overwrite that Source field with the resultant substring. A null input value results in a null output value.
Dialog Options
Source
Specifies the source field on which you wish to perform the substring operation.
Length
Specifies the length of the substring you wish to create.
Offset
Specifies the starting index for the substring. Default: 0 - begining of string.
Target
Specifies the name of the field that will contain the substring. If this is the same as an existing Source field, it will effectively overwrite that field with the resultant substring.
Ports
Input Ports
0 - Input
Output Ports
0 - Output containing substring fields
Time Difference
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.
The Time Difference node provides time difference calculations that you configure. Several calculations can be created to be executed by the node. The calculations may be configured to compare two input fields or an input field and a constant value or the current time at execution. Two time values must be provided, a start time and an end time. The calculation will compute the difference between the end time and the start time. You must specify the value type for the start and the end times. Valid values are:
FIELD: a field from the input data record is selected
CONSTANT: a constant date/time value is provided
NOW: the current time at execution will be used
The start and end values must be of compatible types. A difference cannot be taken between a date and a time of day. If one value represents a date or time of day, the other may be a timestamp, in which case the timestamp is truncated to the matching type based on the default time zone.
Dialog Options
Start Time Value Type
Specifies the type of input value to use for the start time. The input value may be a field of the input data, a constant value, or the current time at execution.
Start Time Input Field
When the value type is specified as FIELD an input field is selected to provide the start time values.
Start Time Constant Value
When the value type is specified as CONSTANT the value of this option is used as the start time. The value must be in the ISO date, time, or timestamp format, as is appropriate.
End Time Value Type
Specifies the type of input value to use for the end time. The input value may be a field of the input data, a constant value, or the current time at execution.
End Time Input Field
When the value type is specified as FIELD, an input field is selected to provide the end time values.
End Time Constant Value
When the value type is specified as CONSTANT, the value of this option is used as the end time. The value must be in the ISO date, time, or timestamp format, as is appropriate.
Granularity
The granularity of the result of the time difference calculation. Only granularities which make sense for the given values are displayed. For dates, granularities can be no smaller than a day. For time of day, granularities can be no larger than an hour.
Output Field
Specifies the name to give the field in the output data containing the result of this calculation. This name must be unique in the namespace of the output data record.
Scale
Specifies the fractional scale of the output data values. The default is a scale of zero, providing no fractional values, but data values rounded to the nearest integer value. A scale of one would provide a single digit to the right of the decimal point and so on.
Ports
Input Ports
0 - Input data
Output Ports
0 - Original data plus calculated time differences.
Trim Whitespace
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.
Trims leading and trailing whitespace from the specified String fields in the Input record flow.
Dialog Options
Include/Exclude
Selection of fields on which to trim whitespace.
Ports
Input Ports
0 - Input
Output Ports
0 - Input data with trimmed String fields.
Type Conversion
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.
The Type Conversion node constructs a new record flow with the original input fields converted into the specified types. Only supported type conversions are available for use.
The formatting pattern that should be used by the conversion function can also be specified. If the formatting pattern is not specified, the default conversion and formatting behavior will be used. The converted fields will have the same names as the original fields.
Dialog Options
Conversions
Table used to define the various type conversion available:
Field Name - Specifies the name of the field from the input data.
Original Type - Specifies the type of the field in the input data.
New Type - Specifies the type to convert the field into.
Format Pattern - Specifies the format pattern that will be used for the conversion, if applicable.
Format Patterns
Various type conversions support different format patterns to cast the data. These are most commonly used when casting to or from string types.
Boolean - When converting to a Boolean you must provide the unquoted "truth, falsity" values separated by a comma, for example, true, false
Date/Timestamp - When converting to or from date types you may use a format string defining the Java DateFormat, for example, yyyy.MM.dd HH:mm:ss Z
Enum - When converting to an enum you must provide a comma separated list of the unquoted enumerable string values, for example, Monday, Tuesday, Wednesday, Thursday, Friday
Numeric - When converting to or from numeric types you may use a format string defining the Java DecimalFormat, for example, ##0.#####E0
String - You may choose to format a string field with one of these options:
TRIMMED - Trims whitespace from the string.
UPPERCASE - Converts all characters to uppercase.
LOWERCASE - Converts all characters to lowercase.
Add
Adds a new conversion.
Remove
Removes the currently selected conversion.
Remove All
Removes all current conversions.
Ports
Input Ports
0 - Input
Output Ports
0 - Output with the selected fields converted to the desired types
Analytics Nodes
Association Rules
ARM Model Converter
ARM Model Converter converts a PMML model containing association modeling results into the selected format.
FP-growth
FP-growth mines input transactions for frequent item sets and association rules using the FP-growth algorithm.
Frequent Items
Frequent Items discovers frequent items in a dataset containing items segregated by transactions.
Classifiers
Decision Tree Learner
Decision Tree Learner creates a Decision Tree PMML model for the given input data.
Decision Tree Predictor
Decision Tree Predictor performs classification of input data based on a Decision Tree PMML model.
Decision Tree Pruner
Decision Tree Pruner prunes a Decision Tree Model.
K-Nearest Neighbors Classifier
K-Nearest Neighbors Classifier classifies or predicts unlabeled data using the k-nearest neighbors algorithm.
Naive Bayes Learner
Naive Bayes Learner creates a Naive Bayes PMML model for the given input data.
Naive Bayes Predictor
Naive Bayes Predictor performs classification of input data based on a Naive Bayes PMML model.
SVM Learner
SVM Learner builds a support vector machine model.
SVM Predictor
SVM Predictor performs classification based on a support vector machine model.
Clustering
Cluster Predictor
Cluster Predictor assigns input data to clusters based on the PMML clustering model.
k-Means
k-Means computes k-Means clustering.
Regression
Linear Regression (Learner)
Linear Regression (Learner) performs linear regression learning.
Logistic Regression (Learner)
Logistic Regression (Learner) performs logistic regression learning.
Logistic Regression (Predictor)
Logistic Regression (Predictor) predicts a target value using a previously built logistic regression model.
Regression (Predictor)
Regression (Predictor) predicts a target value using a previously built regression model.
Viz
Diagnostics Chart Drawer
Diagnostics Chart Drawer enables the building of diagnostic charts.
Transformations Nodes
Aggregate Nodes
Cross Join
Cross Join performs a cross join of two data sets.
Group
Group aggregates data values using aggregation functions based on groups of data defined by key fields.
Join
Join performs joining of two datasets by one or more keys.
Union All
Union All performs a union of two flows.
Filter Nodes
Filter Existing Rows
Filter Existing Rows performs filtering of a dataset based on intersection with another dataset.
Filter Rows
Filter Rows filters rows based on defined field predicates.
Limit Rows
Limit Rows selects a subset of the input dataset within a specified range of positions.
Random Sample
Random Sample randomly samples input data.
Remove Duplicates
Remove Duplicates removes duplicate rows based on the defined set of group keys.
Select Fields
Select Fields selects (filter) fields in a record flow.
Manipulation Nodes
Assert Sorted
Assert Sorted asserts the input data is already sorted.
Columns To Rows (Unpivot)
Columns To Rows - Unpivot unpivots data from a wider representation (columns) into a narrow representation (rows).
Date Value Extraction
Date Value Extraction extracts individual fields from a date or timestamp data field.
Derive Fields
Derive Fields derives new fields from existing fields, using specified expressions.
Missing Value
Missing Value replaces missing values in the input dataset according to the configured actions.
Normalize Values
Normalize Values normalizes values for the selected input fields.
Partition Data
Partition Data forces data to be repartitioned.
Randomize Order
Randomize Order reorders input in a random fashion.
Rank Fields
Rank Fields ranks data fields using the given rank mode.
Regular Expression
Regular Expression performs Regular Expression operations on text input.
Rows to Columns (Pivot)
Rows To Columns - Pivot pivots data from a narrow representation (rows) into a wider representation (columns).
Run Javascript
Run JavaScript runs the Javascript logic for every record of the input.
Run R Snippet
Run R Snippet run an R script that accepts the input data and produces a result.
Run Script
Run Script runs a script that is invoked for every record in the input data.
Sort
Sort sorts the input data.
Split Field
Split Field splits a string field into multiple fields using a specified delimiter.
Substring
Substring creates substrings from existing string fields.
Time Difference
Time Difference performs time difference calculations on input data fields.
Trim Whitespace
Trim Whitespace trims leading and trailing whitespace from the specified String fields in the Input record flow.
Type Conversion
Type Conversion allows conversion of fields to compatible types.
Data Explorer
Data Quality Analyzer
Data Quality Analyzer analyzes data quality.
Data Summarizer
Data Summarizer provides data summary statistics.
Data Summarizer Viewer
Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.
Distinct Values
Distinct Values computes all distinct values and their counts for a given column.
Data Matcher
Cluster Duplicates
Cluster Duplicates clusters records from the Discover Duplicates node into groups of similar records.
Cluster Links
Cluster Links clusters records from the Discover Links node into groups of similar records.
Discover Duplicates
Discover Duplicates discovers duplicate records within a data source using fuzzy matching algorithms.
Discover Links
Discover Links discovers duplicate records between two data sources using fuzzy matching algorithms.
Encode
Encode provides a library of phonetic algorithms used for indexing of words by their pronunciation.
Text Processing
Preprocessing
Convert Case
Convert Case performs case conversions on tokenized text.
Dictionary Filter
Dictionary Filter filters the words included from a tokenized text column into the dictionary input.
Length Filter
Length Filter filters the tokens in the tokenized text column based on the length.
Punctuation Filter
Punctuation Filter filters the punctuation tokens in a tokenized text column.
Regex Filter
Regex Filter filters the tokens in a tokenized text column based on a regular expression.
Text Stemmer
Text Stemmer stems the tokenized text.
Text Tokenizer
Text Tokenizer tokenizes a string field as an object that is used for processing the text.
Word List Filter
Word List Filter filters the tokens in a tokenized text column based on list of words.
Statistics
Calculate N-grams
Calculate N-grams creates a list of n-grams using the specified n from the TextToken input column.
Calculate Word Frequencies
Calculate Word Frequencies calculates the frequency of unique word token in a tokenized text input column.
Count Tokens
Count Tokens counts the number of a particular text element token type.
Expand Frequency
Expand Frequency expands a word frequency or n-gram frequency field.
Expand Text Tokens
Expand Text Tokens expands a TextToken input column by token type.
Frequency Filter
Frequency Filter filters the frequencies in the frequency field.