Text Processing Nodes

The case on all tokens will either be converted to lowercase or uppercase depending on the option selected. If the output column is unspecified then the original input column will be overwritten. This will produce a new tokenized text object that can then be used for further text processing.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Target Column

Specifies the name of the target column in the output that will contain the converted tokenized text.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with token cases converted

Dictionary Filter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DictionaryFilter Operator to Filter Based on Dictionaries.

The Dictionary Filter node can be used to filter all words included in the dictionary input from a tokenized text column.

The dictionary field must be a string column provided through the dictionary input port. All tokens that match any of the strings in the dictionary field will be filtered from the TextToken. The filter can also be inverted, which will filter all the tokens that would have previously been kept. If the output column is unspecified, the original column in the input will be overwritten instead.

Dialog Options

Source Column

Specifies the name of the source column in the input.

Dictionary Column

Specifies the name of the dictionary column in the dictionary input.

Target Column

Specifies the name of the target column in the output.

Invert

Inverts the filter.

Ports

Input Ports

0 - Input data with tokenized text

1 - Input data with dictionary strings

Output Ports

0 - Output data with tokens filtered

Length Filter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.

The Length Filter node can be used to filter the tokens in a tokenized text column based on length. All tokens with a character length less than or equal to the specified length will be filtered from the TextToken. The filter can also be inverted, which will filter all the tokens that would have previously been kept. If the output column is unspecified, the original column in the input will be overwritten instead.

Dialog Options

Source Column

Specifies the name of the source column in the input.

Target Column

Specifies the name of the target column in the output.

Length

Specifies the maximum length of tokens to filter.

Invert

Inverts the filter.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with tokens filtered

Punctuation Filter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.

The Punctuation Filter node can be used to filter the punctuation tokens in a tokenized text column. All tokens consisting of punctuation symbols will be filtered from the TextToken.

The filter can also be inverted which will filter all the tokens that would have previously been kept. If the output column is unspecified the original column in the input will be overwritten instead.

Dialog Options

Source Column

Specifies the name of the source column in the input.

Target Column

Specifies the name of the target column in the input.

Invert

Inverts the filter.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with tokens filtered

Regex Filter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.

The Regex Filter node filters the tokens in a tokenized text column based on a regular expression. All tokens that match the provided regular expression will be filtered from the TextToken.

The filter can also be inverted, which will filter all the tokens that would have previously been kept. If the output column is unspecified, the original column in the input will be overwritten instead.

Dialog Options

Source Column

Specifies the name of the source column in the input.

Target Column

Specifies the name of the target column in the output.

Pattern

Specifies the Java regular expression used as the filter.

Invert

Inverts the filter.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with tokens filtered

Text Stemmer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the TextStemmer Operator to Stem Text.

The Text Stemmer node can be used to stem tokenized text. Stemming is the process of removing the more common morphological and inflexional endings from words. The node stems all tokens in a TextToken input column.

This node uses the Snowball stemmer algorithms to stem the tokens using the specified language algorithm.

If the output column is not set, then the original input column is overwritten. This nodes produces a new tokenized text object that can be used for further text processing.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Target Column

Specifies the name of the target column in the output to contain the stemmed tokenized text.

Stemmer Type

Specifies the snowball stemmer algorithm to use.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with stemmed tokens

Text Tokenizer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the TextTokenizer Operator to Tokenize Text Strings.

The Text Tokenizer node can tokenize a String field as an object that can then be used for a variety of text processing tasks.

The Text Tokenizer node has two options, which specify the source column and target column. The source must be of type String while the target will be a tokenized text object. The contents of the string field are parsed for each row and tokenized to create a TextToken cell that is saved in the target column. This object can then be used for other text processing tasks.

Dialog Options

Source Column

Specifies the name of the string source column in the input.

Target Column

Specifies the name of the target column in the output that will contain the tokenized text.

Custom Word Patterns

Specifies the list of Java regular expressions to be recognized as words when tokenizing the text. You can use a predefined pattern or define a custom one.

Ports

Input Ports

0 - Input data to tokenize

Output Ports

0 - Output data with tokenized text column

Word List Filter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.

The Word List Filter node can be used to filter the tokens in a tokenized text column based on a list of words. All tokens that match any of the strings in the provided list are filtered from the TextToken.

The filter can also be inverted to filter all tokens that do not match the word list and keep the ones that do.

If the output column is not set, the original column in the input is overwritten.

This feature can be useful when filtering a few named entities or specific words from the original text.

When filtering a larger list of words, we recommend that the Dictionary Filter be used instead.

Dialog Options

Source Column

Specifies the name of the source column in the input.

Target Column

Specifies the name of the target column in the output.

Word List

Specifies the list of all words to filter against.

Invert

Inverts the filter.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with tokens filtered

Statistics Nodes

• Calculate N-grams

• Calculate Word Frequencies

• Count Tokens

• Expand Frequency

• Expand Text Tokens

• Frequency Filter

Calculate N-grams

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CalculateNGramFrequency Operator to Calculate N-gram Frequencies.

The Calculate N-grams node can be used to create a list of n-grams using the specified n from a TextToken input column. It also determines the absolute frequency of each n-gram.

The node creates a new output column containing the calculated n-grams and their frequencies.

The source column must be a TextToken cell, while the output column is encoded as an NgramFrequency cell. The n-gram frequency object can then be used by various other nodes that operate on frequency cells.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Target Column

Specifies the name of the target column in the output that contains the n-grams and their frequencies.

Specifies n to use when populating the n-gram list.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with n-grams and frequencies

Calculate Word Frequencies

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CalculateWordFrequency Operator to Calculate Word Frequencies.

The Calculate Word Frequencies node calculates the frequency of each unique word token in a tokenized text input column. The node creates a new output column containing the calculated word frequencies. The source column must be a TextToken cell, while the output column is encoded as a WordFrequency cell. The word frequency object can then be used by other nodes that operate on frequency cells.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Target Column

Specifies the name of the target column in the output that contains the word frequencies.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with word frequencies

Count Tokens

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CountTokens Operator to Count Tokens.

The Count Tokens node counts the number of a particular text element token type. The types of tokens that can be counted are words, sentences, paragraphs, and documents. By default this node counts the number of word tokens in each TextToken. The total counts are put in the specified count column.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Count Column

Specifies the name of the column containing the counts in the output.

Token Type

Specifies the token type to be counted: Word, Sentence, Paragraph, or Document.

Ports

Input Ports

0 - Input data with tokenized text

Output Ports

0 - Output data with token counts

Expand Frequency

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ExpandTextFrequency Operator to Expand Text Frequencies.

The Expand Frequency node can be used to expand a word frequency or n-gram frequency field. It outputs a field containing the expanded text and a field containing the associated frequencies. If either of the output fields are unspecified, then the field is not included in the output. This dropping of fields causes an expansion of the original data, since any field not actively expanded is simply copied. The output for the frequencies can also be customized by selecting whether absolute or relative frequencies are written to the frequency output field.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Text Column

Specifies the name of the text column in the output.

Frequency Column

Specifies the name of the frequency column in the output.

Relative Frequency

If set, outputs relative frequencies. If not set, outputs absolute frequencies.

Ports

Input Ports

0 - Input data with text frequencies

Output Ports

0 - Output data with expanded columns

Expand Text Tokens

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ExpandTextTokens Operator to Expand Text Tokens.

The Expand Text Tokens node can be used to expand a TextToken input column by token type. It creates a new string field specified by the target column name and populates it with elements from the original TextToken field, with a single element per copied row. This action causes an expansion of the original data, since any field not actively expanded is simply copied.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Target Column

Specifies the name of the target column in the output.

Token Type

Specifies the token type to expand in the output.

Ports

Input Ports

0 - Input data with text tokens

Output Ports

0 - Output data with expanded columns

Frequency Filter

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the TextFrequencyFilter Operator to Filter Frequencies.

The Frequency Filter node can be used to filter the frequencies in a frequency cell. It can filter n-gram frequencies or word frequencies and outputs the filtered frequencies in the specified field. The frequencies can be filtered based on a minimum and maximum threshold value or filtered so that only the top frequencies are kept.

If the output field is not set, then the original frequency column is overwritten in the output. The new filtered frequency object can be used by other nodes operate on frequency cells.

Dialog Options

Source Column

Specifies the name of the source column in the input containing tokenized text.

Target Column

Specifies the name of the target column in the output that will contain the n-gram frequencies.

Enable Threshold Filtering

Enables filtering based on minimum and maximum frequencies inclusively.

min

Specifies the minimum frequency to keep.

max

Specifies the maximum frequency to keep.

Enable Top Frequency Filtering

Enables filtering all but the top frequencies.

Max Top Frequencies

Specifies the maximum number of top frequencies to keep.

Ports

Input Ports

0 - Input data with text frequencies

Output Ports

0 - Output data with filtered frequenciesU