Text Processing Operators
The DataFlow operator library includes several operators for text processing, which involves extracting information from unstructured text. This is achieved by first structuring the text in a format suitable for analysis, followed by applying various transformations and statistical techniques.
The DataFlow text processing library provides operators that can perform basic text mining and processing tasks on unstructured text. The primary operator is the TextTokenizer operator, which analyzes the text within a string field of a record and creates an object that represents a structured form of the original text. This TokenizedText object can then be used by a variety of other operators within the library to perform various transformations and statistical analysis on the text.
For more information, refer to the following topics:
TextTokenizer Operator
The
TextTokenizer operator tokenizes a string field in the source and produces a field containing a
TokenizedText object. The TextTokenizer operator has two main properties that determine the string field in the input that should be tokenized and the object field in the output that will store the encoded TokenizedText object. The contents of the string field will be parsed and tokenized, creating a TokenizedText object that will be encoded into the output field. This TokenizedText object can then be used by downstream operators for further text processing tasks.
Code Example
This example demonstrates using the
TextTokenizer operator to tokenize a message field in a record.
Using the TextTokenizer operator in Java
//Create a TextTokenizer operator
TextTokenizer tokenizer = graph.add(new TextTokenizer("messageField"));
tokenizer.setOutputField("messageTokens");
Using the TextTokenizer operator in JavaScript
//Create a TextTokenizer operator
var results = dr.textTokenizer(data, {inputField:"messageField", outputField:"messageTokens"});
Properties
The
TextTokenizer operator has the following properties.
Ports
The
TextTokenizer operator provides a single input port.
The
TextTokenizer operator provides a single output port.
CountTokens Operator
The
CountTokens operator counts the number of a particular type of token in a
TokenizedText field. The CountTokens operator has two main properties that define the input field that contains the tokenized text with tokens to count and the name of the output field that should contain the counts. By default the operator will count the number of word tokens; however, this property can be modified to count any valid
TextElementType.
Code Example
This example demonstrates using the
CountTokens operator to count the number of sentence tokens in the tokenized text field.
Using the CountTokens operator in Java
//Create a CountTokens operator
CountTokens counter = graph.add(new CountTokens("messageTokens");
counter.setOutputField("sentenceCount");
counter.setTokenType(TextElementType.SENTENCE);
Using the CountTokens operator in JavaScript
//Create a CountTokens operator
var results = dr.countTokens(data, {inputField:"messageTokens", outputField:"wordCount"});
Properties
The
CountTokens operator has the following properties.
Ports
The
CountTokens operator provides a single input port.
The
CountTokens operator provides a single output port.
FilterText Operator
The
FilterText operator filters a tokenized text field in the source and produces a field containing a filtered
TokenizedText object.
The FilterText operator has three properties: input field, output field, and the list of text filters that will be applied to the input. The input field must be a tokenized text object. The tokenized text object will be filtered of all tokens that are specified by the text filters. This will produce a new tokenized text object that will be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.
Available Filters
LengthFilter
Filters all words with a length less than or equal to the specified length.
PunctuationFilter
Filters all standalone punctuation tokens. Will not remove punctuation that is part of a word such as an apostrophe or hyphen.
RegexFilter
Filters all words that match against the supplied regular expression.
TextElementFilter
Filters all text elements in the hierarchy that are higher than the specified element. The default hierarchy is Document, Paragraph, Sentence, Word.
WordFilter
Removes any words in a provided list.
All the available filters have the option of inverting the filter. This has the effect of keeping all the words that pass the filter instead of those that fail, and effectively inverts the output.
Code Example
This example demonstrates using the
FilterText operator to filter out XML/HTML tags and punctuation.
Using the FilterText operator in Java
//Create a FilterText operator
FilterText filter = graph.add(new FilterText("messageTokens");
filter.setOutputField("filteredTokens");
filter.setTextFilters( new PunctuationFilter(),
new RegexFilter("<(\"[^\"]*\"|'[^']*'|[^'\">])*>"));
Using the FilterText operator in JavaScript
//Create a FilterText operator
var results = dr.filterText(data, {inputField:"messageTokens", outputField:"filteredTokens",
textFilters:[new PunctuationFilter()]});
Properties
The
FilterText operator has the following properties.
Ports
The
FilterText operator provides a single input port.
The
FilterText operator provides a single output port.
DictionaryFilter Operator
The
DictionaryFilter operator filters a tokenized text field in the source based on a dictionary. This will produce a field containing a filtered
TokenizedText object in the output.
The DictionaryFilter operator has four properties: input field, output field, dictionary input field, and whether the filter should be inverted. The input field must be a tokenized text object.
The tokenized text object will be filtered of all words that are specified in the dictionary input. This will produce a new tokenized text object that will be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.
Code Example
This example demonstrates using the
DictionaryFilter operator to filter out stop words.
Using the DictionaryFilter operator in Java
//Create a DictionaryFilter operator
DictionaryFilter filter = graph.add(new DictionaryFilter("messageTokens");
filter.setOutputField("filteredTokens");
filter.setDictionaryField("dictionary");
Using the DictionaryFilter operator in JavaScript
//Create a DictionaryFilter operator
var results = dr.dictionaryFilter(data, {inputField:"messageTokens",outputField:"filteredTokens",dictionaryField:"dictionary"});
Properties
The
DictionaryFilter operator has the following properties.
Ports
The
DictionaryFilter operator provides a single input port.
The
DictionaryFilter operator provides a single output port.
ConvertTextCase Operator
The
ConvertTextCase operator performs case conversions on a tokenized text object and produces a field containing the modified tokenized text object. The operator will convert all the characters in the individual text tokens into upper- or lowercase depending on the settings and will produce a new tokenized text object with the specified case conversions applied to each token.
The ConvertTextCase operator has three properties: input field, output field, and the case used for the conversion. The input field must be a tokenized text object, and the output field will similarly be a tokenized text object. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. The new tokenized text object can then be used for further text processing tasks.
Code Example
This example demonstrates using the
ConvertTextCase operator to convert all tokens into lowercase.
Using the ConvertTextCase operator in Java
//Create a ConvertTextCase operator
ConvertTextCase converter = graph.add(new ConvertTextCase("messageTokens");
converter.setOutputField("convertedTokens");
converter.setCaseFormat(Case.LOWER);
Using the ConvertTextCase Operator in JavaScript
//Create a ConvertTextCase operator
var results = dr.convertTextCase(data, {inputField:"messageTokens", outputField:"convertedTokens", caseFormat:"LOWER"});
Properties
The
ConvertTextCase operator has the following properties.
Ports
The
ConvertTextCase operator provides a single input port.
The
ConvertTextCase operator provides a single output port.
TextStemmer Operator
Stemming is the process for removing the commoner morphological and inflexional endings from words.
The
TextStemmer operator stems a tokenized text field in the source and produces a field containing a stemmed
TokenizedText object.
The TextStemmer operator has three properties: input field, output field, and the stemmer to use. The input field must be a tokenized text object. Each of the words in the tokenized text object will be stemmed using the rules defined by the specified stemmer algorithm. This will produce a new tokenized text object with the original words replaced by their stemmed form, which will then be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.
Available Stemmers
The stemmers use the snowball stemmer algorithms to perform stemming. For more information, visit the snowball website at
http://snowball.tartarus.org/. The available stemmers are:
• Armenian
• Basque
• Catalan
• Danish
• Dutch
• English
• Finnish
• French
• German
• Hungarian
• Irish
• Italian
• Lovins
• Norwegian
• Porter
• Portuguese
• Romanian
• Russian
• Spanish
• Swedish
• Turkish
Code Example
This example demonstrates using the
TextStemmer operator to stem a text field with the Porter stemmer algorithm.
Using the TextStemmer operator in Java
//Create a TextStemmer operator
TextStemmer stemmer = graph.add(new TextStemmer("messageTokens");
stemmer.setOutputField("stemmedTokens");
stemmer.setStemmerType(StemmerType.PORTER);
Using the TextStemmer operator in JavaScript
//Create a TextStemmer operator
var results = dr.textStemmer(data, {inputField:"messageTokens", outputField:"stemmedTokens", stemmerType:"PORTER"});
Properties
The
TextStemmer operator has the following properties.
Ports
The
TextStemmer operator provides a single input port.
The
TextStemmer operator provides a single output port.
ExpandTextTokens Operator
The
ExpandTextTokens operator can be used to expand a tokenized text field. The operator will create a new string field in the output, which it will then expand the tokenized text object into based on the token type specified, with one token per copied row. This will cause an expansion of the original input data since the rows associated with the original tokenized text object will be duplicated for every token in the output.
The ExpandTextTokens operator has three properties: input field, output field, and the
TextElementType to expand in the output. The input field must be a tokenized text object. If there are no tokens of the specified type contained in the tokenized text object, the string output field will contain null for that row.
Code Example
This example demonstrates using the
ExpandTextTokens to expand the individual words of the original text into a string field.
Using the ExpandTextTokens operator in Java
//Create an ExpandTextTokens operator
ExpandTextTokens expander = graph.add(new ExpandTextTokens("messageTokens");
expander.setOutputField("words");
expander.setTokenType(TextElementType.WORD);
Using the ExpandTextTokens operator in JavaScript
//Create an ExpandTextTokens operator
var results = dr.expandTextTokens(data, {inputField:"messageTokens", outputField:"sentences", tokenType:"SENTENCE"});
Properties
The
ExpandTextTokens operator has the following properties.
Ports
The
ExpandTextTokens operator provides a single input port.
The
ExpandTextTokens operator provides a single output port.
CalculateWordFrequency Operator
The
CalculateWordFrequency operator determines the frequencies of each word in a
TokenizedText field. The
CalculateWordFrequency operator has two main properties that define the input field that contains the tokenized text, and the output field for the frequency map
The operator will output a
WordMap object that contains the words and their associated frequencies. This object can then be used by other operators such as
TextFrequencyFilter Operator or
ExpandTextFrequency Operator.
Code Example
This example demonstrates using the
CalculateWordFrequency operator to determine the frequency of each word in the tokenized text field.
Using the CalculateWordFrequency operator in Java
//Create a CalculateWordFrequency operator
CalculateWordFrequency freqCalc = graph.add(new CalculateWordFrequency("messageTokens");
freqCalc.setOutputField("wordsFrequencies");
Using the CalculateWordFrequency Operator in JavaScript
//Create a CalculateWordFrequency operator
var results = dr.calculateWordFrequency(data, {inputField:"messageTokens", outputField:"wordFrequencies"});
Properties
The
CalculateWordFrequency operator has the following properties.
Ports
The
CalculateWordFrequency operator provides a single input port.
The
CalculateWordFrequency operator provides a single output port.
CalculateNGramFrequency Operator
The
CalculateNGramFrequency operator determines the frequencies of each n-gram in a
TokenizedText field. The CalculateNGramFrequency operator has three main properties which define the input field that contains the tokenized text, the output field for the frequency map, and the n that will be used by the calculation. The operator will output an
NGramMap object that contains the n-grams and their associated frequencies.
This object can then be used by other operators such as
TextFrequencyFilter Operator or
ExpandTextFrequency Operator.
Code Example
This example demonstrates using the
CalculateNGramFrequency operator to determine the frequency of each bigram in the tokenized text field.
Using the CalculateNGramFrequency operator in Java
//Create a CalculateNGramFrequency operator
CalculateNGramFrequency freqCalc = graph.add(new CalculateNGramFrequency("messageTokens");
freqCalc.setOutputField("ngramFrequencies");
freqCalc.setN(2);
Using the CalculateNGramFrequency operator in JavaScript
//Create a CalculateNGramFrequency operator
var results = dr.calculateNGramFrequency(data, {inputField:"messageTokens", outputField:"ngramFrequencies"});
Properties
The
CalculateNGramFrequency operator has the following properties.
Ports
The
CalculateNGramFrequency operator provides a single input port.
The
CalculateNGramFrequency operator provides a single output port.
TextFrequencyFilter Operator
The
TextFrequencyFilter operator filters a list of frequencies produced by the
CalculateWordFrequency or
CalculateNGramFrequency operators.
The operator has several properties that must be set to determine the behavior of the filter. The input field for the frequency maps must be set. Additionally a minimum or maximum threshold may be set for the frequencies. This will cause the filter to only keep those absolute frequencies between the minimum or maximum threshold inclusively. Also the total number of top frequencies to keep may be set. This will be applied after determining if the frequencies are within the threshold. Therefore if fewer than the specified number of total frequencies are available, they will all be included in the output.
Any combination of these filtering methods may be applied to the frequencies. If an output field for the filtered frequency maps is unspecified, the original field will be overwritten in the output.
Code Example
This example demonstrates using the
TextFrequencyFilter operator to filter all the absolute frequencies below two and keeps the top ten frequencies.
Using the TextFrequencyFilter operator in Java
//Create a TextFrequencyFilter operator
TextFrequencyFilter filter = graph.add(new TextFrequencyFilter("FrequencyMap");
filter.setOutputField("filteredFrequencies");
filter.setMinThreshold(2);
filter.setTotalNumber(10);
Using the TextFrequencyFilter operator in JavaScript
//Create a TextFrequencyFilter operator
var results = dr.textFrequencyFilter (data,{inputField:"FrequencyMap", outputField:"filteredFrequencies", minThreshold:2, totalNumber:10});
Properties
The
TextFrequencyFilter has the following properties.
Ports
The
TextFrequencyFilter provides a single input port.
The
TextFrequencyFilter provides a single output port
ExpandTextFrequency Operator
The
ExpandTextFrequency operator expands a frequency map produced by the
CalculateWordFrequency or
CalculateNGramFrequency operators. If either output field is unspecified the operator will only include the specified output fields in the output. This can be useful if only the text elements or frequencies are needed.
As expected, this operator will cause an expansion of the original input data, and any fields that are not directly expanded will simply be copied. If the row does not have a frequency map to expand, the row will simply be copied into the output with a null indicator inserted into the new fields.
The output of the frequencies can additionally be controlled by setting whether the operator should output relative or absolute frequencies.
Code Example
This example demonstrates using the
ExpandTextFrequency operator to expand the word frequencies in the record.
Using the ExpandTextFrequency operator in Java
//Create an ExpandTextFrequency operator
ExpandTextFrequency expander = graph.add(new ExpandTextFrequency("frequencyMap");
expander.setTextOutputField("words");
expander.setFreqOutputField("frequencies");
Using the ExpandTextFrequency operator in JavaScript
//Create an ExpandTextFrequency operator
var results = dr.expandTextFrequency(data, {inputField:"frequencyMap", textOutputField:"words", freqOutputField:"frequencies"});
Properties
The
ExpandTextFrequency operator has the following properties.
Ports
The
ExpandTextFrequency operator provides a single input port.
The
ExpandTextFrequency operator provides a single output port.
GenerateBagOfWords Operator
The
GenerateBagOfWords operator can be used to determine all of the distinct words in a
TokenizedText field. The GenerateBagOfWords operator has two main properties that define the input field containing the tokenized text and the output field for the words.
Code Example
This example demonstrates using the
GenerateBagOfWords operator to determine the frequency of each word in the tokenized text field.
Using the CalculateWordFrequency operator in Java
//Create a GenerateBagOfWords operator
GenerateBagOfWords bow = graph.add(new GenerateBagOfWords("messageTokens");
bow.setOutputField("words");
Using the CalculateWordFrequency operator in JavaScript
//Create a GenerateBagOfWords operator
var results = dr.generateBagOfWords(data, {inputField:"messageTokens", outputField:"words"});
Properties
The
GenerateBagOfWords operator has the following properties.
Ports
The
GenerateBagOfWords operator provides a single input port.
The
GenerateBagOfWords operator provides a single output port.
DrawDiagnosticsChart Operator
To build diagnostic charts, the
DrawDiagnosticsChart operator uses confidence values along with the actual target values (true class).
Note: These values are obtained using the input of this operator and output of one or multiple predictors.
The supported chart types are:
ROC chart
Provides a comparison between the true positive rate (y-axis) and false positive rate (x-axis) when the confidence threshold decreases.
Gains chart (CRC, Cumulative Response Chart)
Provides the change in true positive rate (y-axis) when the percentage of the targeted population (x-axis) decreases.
Lift chart
Provides the change in lift (y-axis) which is the ratio between predictor result and baseline when the percentage of the targeted population (x-axis) increases.
The true positive rate is the ratio of all correct positive classifications and total number of positive samples in the test set. The false positive rate is the ratio of all incorrect positive classifications and total number of negative samples in the test set.
The
DrawDiagnosticsChart operator can accept one to five predictor operators as input sources. Each predictor output must contain a column for the actual target values and confidence values (probability, score) assigned by the given predictor.
Note: This operator is not parallelizable and runs on a single node in cluster mode.
Output Chart
The following images are examples for the chart types.
Code Example
Using the DrawDiagnosticsChart operator in Java
// Create a diagnostics chart drawer with two input ports
DrawDiagnosticsChart drawer = new DrawDiagnosticsChart(2);
drawer.setConfidenceFieldNames(Arrays.asList("NaiveBayesConfidence", "BaseConfidence"));
drawer.setTargetFieldNames(Arrays.asList("target", "target"));
drawer.setChartNames(Arrays.asList("NAIVE BAYES", "BASE"));
drawer.setChartType(ChartType.GAINS);
drawer.setResultSize(10);
drawer.setTargetValue("E");
drawer.setOutputPath("chart.png");
graph.connect(bayesPredictor.getOutput(), drawer.getInput(0));
graph.connect(basePredictor.getOutput(), drawer.getInput(1));
Using the DrawDiagnosticsChart operator in RushScript
// Using default settings: five (optional) input ports, values for disconnected ports as nulls.
var drawer = dr.drawDiagnosticsChart("chart", bayesPredictor, null, basePredictor, null, null, {
confidenceFieldNames: ["NaiveBayesConfidence", null, "BaseConfidence", null, null],
targetFieldNames: ["target", null, "target", null, null],
chartNames: ["NAIVE BAYES", null, "BASE", null, null],
outputPath: "chart.png",
chartType: com.pervasive.datarush.analytics.viz.ChartType.GAINS,
resultSize: 10,
targetValue: "E"
});
Properties
The
DrawDiagnosticsChart operator provides the following properties.
Ports
The
DrawDiagnosticsChart operator provides an arbitrary number of input ports. It is configured using constructor arguments. The default setting is five input ports. Only port 0 is mandatory.
The
DrawDiagnosticsChart operator provides a single output port.