Using DataFlow : Using DataFlow in KNIME : DataFlow Nodes in KNIME : Text Processing Nodes
 
Share this page                  
Text Processing Nodes
Preprocessing Nodes
Convert Case
Dictionary Filter
Length Filter
Punctuation Filter
Regex Filter
Text Stemmer
Text Tokenizer
Word List Filter
Statistics Nodes
Calculate N-grams
Calculate Word Frequencies
Count Tokens
Expand Frequency
Expand Text Tokens
Frequency Filter
Preprocessing Nodes
Convert Case
Dictionary Filter
Length Filter
Punctuation Filter
Regex Filter
Text Stemmer
Text Tokenizer
Word List Filter
Convert Case
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ConvertTextCase Operator to Convert Case.
The Convert Case node can be used to perform case conversions on tokenized text. It will convert the case on all tokens in a TextToken input column.
The case on all tokens will either be converted to lowercase or uppercase depending on the option selected. If the output column is unspecified then the original input column will be overwritten. This will produce a new tokenized text object that can then be used for further text processing.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Target Column
Specifies the name of the target column in the output that will contain the converted tokenized text.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with token cases converted
Dictionary Filter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DictionaryFilter Operator to Filter Based on Dictionaries.
The Dictionary Filter node can be used to filter all words included in the dictionary input from a tokenized text column.
The dictionary field must be a string column provided through the dictionary input port. All tokens that match any of the strings in the dictionary field will be filtered from the TextToken. The filter can also be inverted, which will filter all the tokens that would have previously been kept. If the output column is unspecified, the original column in the input will be overwritten instead.
Dialog Options
Source Column
Specifies the name of the source column in the input.
Dictionary Column
Specifies the name of the dictionary column in the dictionary input.
Target Column
Specifies the name of the target column in the output.
Invert
Inverts the filter.
Ports
Input Ports
0 - Input data with tokenized text
1 - Input data with dictionary strings
Output Ports
0 - Output data with tokens filtered
Length Filter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.
The Length Filter node can be used to filter the tokens in a tokenized text column based on length. All tokens with a character length less than or equal to the specified length will be filtered from the TextToken. The filter can also be inverted, which will filter all the tokens that would have previously been kept. If the output column is unspecified, the original column in the input will be overwritten instead.
Dialog Options
Source Column
Specifies the name of the source column in the input.
Target Column
Specifies the name of the target column in the output.
Length
Specifies the maximum length of tokens to filter.
Invert
Inverts the filter.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with tokens filtered
Punctuation Filter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.
The Punctuation Filter node can be used to filter the punctuation tokens in a tokenized text column. All tokens consisting of punctuation symbols will be filtered from the TextToken.
The filter can also be inverted which will filter all the tokens that would have previously been kept. If the output column is unspecified the original column in the input will be overwritten instead.
Dialog Options
Source Column
Specifies the name of the source column in the input.
Target Column
Specifies the name of the target column in the input.
Invert
Inverts the filter.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with tokens filtered
Regex Filter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.
The Regex Filter node filters the tokens in a tokenized text column based on a regular expression. All tokens that match the provided regular expression will be filtered from the TextToken.
The filter can also be inverted, which will filter all the tokens that would have previously been kept. If the output column is unspecified, the original column in the input will be overwritten instead.
Dialog Options
Source Column
Specifies the name of the source column in the input.
Target Column
Specifies the name of the target column in the output.
Pattern
Specifies the Java regular expression used as the filter.
Invert
Inverts the filter.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with tokens filtered
Text Stemmer
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the TextStemmer Operator to Stem Text.
The Text Stemmer node can be used to stem tokenized text. Stemming is the process of removing the more common morphological and inflexional endings from words. The node stems all tokens in a TextToken input column.
This node uses the Snowball stemmer algorithms to stem the tokens using the specified language algorithm.
If the output column is not set, then the original input column is overwritten. This nodes produces a new tokenized text object that can be used for further text processing.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Target Column
Specifies the name of the target column in the output to contain the stemmed tokenized text.
Stemmer Type
Specifies the snowball stemmer algorithm to use.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with stemmed tokens
Text Tokenizer
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the TextTokenizer Operator to Tokenize Text Strings.
The Text Tokenizer node can tokenize a String field as an object that can then be used for a variety of text processing tasks.
The Text Tokenizer node has two options, which specify the source column and target column. The source must be of type String while the target will be a tokenized text object. The contents of the string field are parsed for each row and tokenized to create a TextToken cell that is saved in the target column. This object can then be used for other text processing tasks.
Dialog Options
Source Column
Specifies the name of the string source column in the input.
Target Column
Specifies the name of the target column in the output that will contain the tokenized text.
Custom Word Patterns
Specifies the list of Java regular expressions to be recognized as words when tokenizing the text. You can use a predefined pattern or define a custom one.
Ports
Input Ports
0 - Input data to tokenize
Output Ports
0 - Output data with tokenized text column
Word List Filter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FilterText Operator to Filter Tokenized Text.
The Word List Filter node can be used to filter the tokens in a tokenized text column based on a list of words. All tokens that match any of the strings in the provided list are filtered from the TextToken.
The filter can also be inverted to filter all tokens that do not match the word list and keep the ones that do.
If the output column is not set, the original column in the input is overwritten.
This feature can be useful when filtering a few named entities or specific words from the original text.
When filtering a larger list of words, we recommend that the Dictionary Filter be used instead.
Dialog Options
Source Column
Specifies the name of the source column in the input.
Target Column
Specifies the name of the target column in the output.
Word List
Specifies the list of all words to filter against.
Invert
Inverts the filter.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with tokens filtered
Statistics Nodes
Calculate N-grams
Calculate Word Frequencies
Count Tokens
Expand Frequency
Expand Text Tokens
Frequency Filter
Calculate N-grams
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CalculateNGramFrequency Operator to Calculate N-gram Frequencies.
The Calculate N-grams node can be used to create a list of n-grams using the specified n from a TextToken input column. It also determines the absolute frequency of each n-gram.
The node creates a new output column containing the calculated n-grams and their frequencies.
The source column must be a TextToken cell, while the output column is encoded as an NgramFrequency cell. The n-gram frequency object can then be used by various other nodes that operate on frequency cells.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Target Column
Specifies the name of the target column in the output that contains the n-grams and their frequencies.
N
Specifies n to use when populating the n-gram list.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with n-grams and frequencies
Calculate Word Frequencies
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CalculateWordFrequency Operator to Calculate Word Frequencies.
The Calculate Word Frequencies node calculates the frequency of each unique word token in a tokenized text input column. The node creates a new output column containing the calculated word frequencies. The source column must be a TextToken cell, while the output column is encoded as a WordFrequency cell. The word frequency object can then be used by other nodes that operate on frequency cells.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Target Column
Specifies the name of the target column in the output that contains the word frequencies.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with word frequencies
Count Tokens
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the CountTokens Operator to Count Tokens.
The Count Tokens node counts the number of a particular text element token type. The types of tokens that can be counted are words, sentences, paragraphs, and documents. By default this node counts the number of word tokens in each TextToken. The total counts are put in the specified count column.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Count Column
Specifies the name of the column containing the counts in the output.
Token Type
Specifies the token type to be counted: Word, Sentence, Paragraph, or Document.
Ports
Input Ports
0 - Input data with tokenized text
Output Ports
0 - Output data with token counts
Expand Frequency
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ExpandTextFrequency Operator to Expand Text Frequencies.
The Expand Frequency node can be used to expand a word frequency or n-gram frequency field. It outputs a field containing the expanded text and a field containing the associated frequencies. If either of the output fields are unspecified, then the field is not included in the output. This dropping of fields causes an expansion of the original data, since any field not actively expanded is simply copied. The output for the frequencies can also be customized by selecting whether absolute or relative frequencies are written to the frequency output field.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Text Column
Specifies the name of the text column in the output.
Frequency Column
Specifies the name of the frequency column in the output.
Relative Frequency
If set, outputs relative frequencies. If not set, outputs absolute frequencies.
Ports
Input Ports
0 - Input data with text frequencies
Output Ports
0 - Output data with expanded columns
Expand Text Tokens
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ExpandTextTokens Operator to Expand Text Tokens.
The Expand Text Tokens node can be used to expand a TextToken input column by token type. It creates a new string field specified by the target column name and populates it with elements from the original TextToken field, with a single element per copied row. This action causes an expansion of the original data, since any field not actively expanded is simply copied.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Target Column
Specifies the name of the target column in the output.
Token Type
Specifies the token type to expand in the output.
Ports
Input Ports
0 - Input data with text tokens
Output Ports
0 - Output data with expanded columns
Frequency Filter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the TextFrequencyFilter Operator to Filter Frequencies.
The Frequency Filter node can be used to filter the frequencies in a frequency cell. It can filter n-gram frequencies or word frequencies and outputs the filtered frequencies in the specified field. The frequencies can be filtered based on a minimum and maximum threshold value or filtered so that only the top frequencies are kept.
If the output field is not set, then the original frequency column is overwritten in the output. The new filtered frequency object can be used by other nodes operate on frequency cells.
Dialog Options
Source Column
Specifies the name of the source column in the input containing tokenized text.
Target Column
Specifies the name of the target column in the output that will contain the n-gram frequencies.
Enable Threshold Filtering
Enables filtering based on minimum and maximum frequencies inclusively.
min
Specifies the minimum frequency to keep.
max
Specifies the maximum frequency to keep.
Enable Top Frequency Filtering
Enables filtering all but the top frequencies.
Max Top Frequencies
Specifies the maximum number of top frequencies to keep.
Ports
Input Ports
0 - Input data with text frequencies
Output Ports
0 - Output data with filtered frequenciesU