DF 8.2 | Text Processing Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Text Processing Operators

Was this helpful?

Text Processing Operators

The DataFlow operator library includes several operators for text processing, which involves extracting information from unstructured text. This is achieved by first structuring the text in a format suitable for analysis, followed by applying various transformations and statistical techniques.

The DataFlow text processing library provides operators that can perform basic text mining and processing tasks on unstructured text. The primary operator is the TextTokenizer operator, which analyzes the text within a string field of a record and creates an object that represents a structured form of the original text. This TokenizedText object can then be used by a variety of other operators within the library to perform various transformations and statistical analysis on the text.

For more information, refer to the following topics:

• TextTokenizer Operator

• CountTokens Operator

• FilterText Operator

• DictionaryFilter Operator

• ConvertTextCase Operator

• TextStemmer Operator

• ExpandTextTokens Operator

• CalculateWordFrequency Operator

• CalculateNGramFrequency Operator

• TextFrequencyFilter Operator

• ExpandTextFrequency Operator

• GenerateBagOfWords Operator

TextTokenizer Operator

The TextTokenizer operator tokenizes a string field in the source and produces a field containing a TokenizedText object. The TextTokenizer operator has two main properties that determine the string field in the input that should be tokenized and the object field in the output that will store the encoded TokenizedText object. The contents of the string field will be parsed and tokenized, creating a TokenizedText object that will be encoded into the output field. This TokenizedText object can then be used by downstream operators for further text processing tasks.

Code Example

This example demonstrates using the TextTokenizer operator to tokenize a message field in a record.

Using the TextTokenizer operator in Java

//Create a TextTokenizer operator
TextTokenizer tokenizer = graph.add(new TextTokenizer("messageField"));
tokenizer.setOutputField("messageTokens");

Using the TextTokenizer operator in JavaScript

//Create a TextTokenizer operator
var results = dr.textTokenizer(data, {inputField:"messageField", outputField:"messageTokens"});

Properties

The TextTokenizer operator has the following properties.

Name	Type	Description
inputField	String	The name of the String field to tokenize. If this field does not exist in the input, or is not of type String, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the tokenized text object. Defaults to TokenizedTextField if unspecified.
wordPatterns	List<String>	A list of regular expressions that will be used to find custom word patterns while tokenizing the text.

Ports

The TextTokenizer operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the string data that will be tokenized.

The TextTokenizer operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the tokenized text object field produced from the input data.

CountTokens Operator

The CountTokens operator counts the number of a particular type of token in a TokenizedText field. The CountTokens operator has two main properties that define the input field that contains the tokenized text with tokens to count and the name of the output field that should contain the counts. By default the operator will count the number of word tokens; however, this property can be modified to count any valid TextElementType.

Code Example

This example demonstrates using the CountTokens operator to count the number of sentence tokens in the tokenized text field.

Using the CountTokens operator in Java

//Create a CountTokens operator
CountTokens counter = graph.add(new CountTokens("messageTokens");
counter.setOutputField("sentenceCount");
counter.setTokenType(TextElementType.SENTENCE);

Using the CountTokens operator in JavaScript

//Create a CountTokens operator
var results = dr.countTokens(data, {inputField:"messageTokens", outputField:"wordCount"});

Properties

The CountTokens operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field with tokens to count. If this field does not exist in the input, or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the count. Defaults to TokenCount if unspecified.
tokenType	TextElementType	The specific type of token to count. Defaults to TextElementType.WORD.

Ports

The CountTokens operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be counted.

The CountTokens operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the token count field produced from the input data.

FilterText Operator

The FilterText operator filters a tokenized text field in the source and produces a field containing a filtered TokenizedText object.

The FilterText operator has three properties: input field, output field, and the list of text filters that will be applied to the input. The input field must be a tokenized text object. The tokenized text object will be filtered of all tokens that are specified by the text filters. This will produce a new tokenized text object that will be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.

Available Filters

LengthFilter

Filters all words with a length less than or equal to the specified length.

PunctuationFilter

Filters all standalone punctuation tokens. Will not remove punctuation that is part of a word such as an apostrophe or hyphen.

RegexFilter

Filters all words that match against the supplied regular expression.

TextElementFilter

Filters all text elements in the hierarchy that are higher than the specified element. The default hierarchy is Document, Paragraph, Sentence, Word.

WordFilter

Removes any words in a provided list.

All the available filters have the option of inverting the filter. This has the effect of keeping all the words that pass the filter instead of those that fail, and effectively inverts the output.

Code Example

This example demonstrates using the FilterText operator to filter out XML/HTML tags and punctuation.

Using the FilterText operator in Java

//Create a FilterText operator
FilterText filter = graph.add(new FilterText("messageTokens");
filter.setOutputField("filteredTokens");
filter.setTextFilters( new PunctuationFilter(),
new RegexFilter("<(\"[^\"]*\"|'[^']*'|[^'\">])*>"));

Using the FilterText operator in JavaScript

//Create a FilterText operator
var results = dr.filterText(data, {inputField:"messageTokens", outputField:"filteredTokens",
textFilters:[new PunctuationFilter()]});

Properties

The FilterText operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to filter. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the filtered tokenized text object. If unspecified, it will overwrite the original input field.
textFilters	TextFilter[]	The list of text filters to apply to the input data. The filters will be applied in the order they are present in the list.

Ports

The FilterText operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be filtered.

The FilterText operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the filtered tokenized text object field produced from the input data.

DictionaryFilter Operator

The DictionaryFilter operator filters a tokenized text field in the source based on a dictionary. This will produce a field containing a filtered TokenizedText object in the output.

The DictionaryFilter operator has four properties: input field, output field, dictionary input field, and whether the filter should be inverted. The input field must be a tokenized text object.

The tokenized text object will be filtered of all words that are specified in the dictionary input. This will produce a new tokenized text object that will be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.

Code Example

This example demonstrates using the DictionaryFilter operator to filter out stop words.

Using the DictionaryFilter operator in Java

//Create a DictionaryFilter operator
DictionaryFilter filter = graph.add(new DictionaryFilter("messageTokens");
filter.setOutputField("filteredTokens");
filter.setDictionaryField("dictionary");

Using the DictionaryFilter operator in JavaScript

//Create a DictionaryFilter operator
var results = dr.dictionaryFilter(data, {inputField:"messageTokens",outputField:"filteredTokens",dictionaryField:"dictionary"});

Properties

The DictionaryFilter operator has the following properties.

Name	Type	Description
dictionaryField	String	The name of the dictionary field in the dictionary input.
inputField	String	The name of the tokenized text field to filter. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
inverted	Boolean	Specifies whether the filter must be inverted
outputField	String	The name of the output field that will contain the filtered tokenized text object. If unspecified, it will overwrite the original input field.

Ports

The DictionaryFilter operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be filtered.
dictInput	RecordPort	getInput()	The input data with the dictionary string that will be filtered.

The DictionaryFilter operator provides a single output port.

Name	Type	Get Method	Description
output	RnmecordPort	getOutput()	The output data including the filtered tokenized text object field produced from the input data.

ConvertTextCase Operator

The ConvertTextCase operator performs case conversions on a tokenized text object and produces a field containing the modified tokenized text object. The operator will convert all the characters in the individual text tokens into upper- or lowercase depending on the settings and will produce a new tokenized text object with the specified case conversions applied to each token.

The ConvertTextCase operator has three properties: input field, output field, and the case used for the conversion. The input field must be a tokenized text object, and the output field will similarly be a tokenized text object. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. The new tokenized text object can then be used for further text processing tasks.

Code Example

This example demonstrates using the ConvertTextCase operator to convert all tokens into lowercase.

Using the ConvertTextCase operator in Java

//Create a ConvertTextCase operator
ConvertTextCase converter = graph.add(new ConvertTextCase("messageTokens");
converter.setOutputField("convertedTokens");
converter.setCaseFormat(Case.LOWER);

Using the ConvertTextCase Operator in JavaScript

//Create a ConvertTextCase operator
var results = dr.convertTextCase(data, {inputField:"messageTokens", outputField:"convertedTokens", caseFormat:"LOWER"});

Properties

The ConvertTextCase operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to convert. If this field does not exist in the input, or is not of type TokenizedText, an exception will be thrown at composition time.
outputField	String	The name of the output field that will contain the converted tokenized text object. If unspecified, it will overwrite the original input field.
caseFormat	Case	The case format to use. Can be set to LOWER or UPPER. Default: LOWER.

Ports

The ConvertTextCase operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be converted.

The ConvertTextCase operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the converted tokenized text object field produced from the input data.

TextStemmer Operator

Stemming is the process for removing the commoner morphological and inflexional endings from words.

The TextStemmer operator stems a tokenized text field in the source and produces a field containing a stemmed TokenizedText object.

The TextStemmer operator has three properties: input field, output field, and the stemmer to use. The input field must be a tokenized text object. Each of the words in the tokenized text object will be stemmed using the rules defined by the specified stemmer algorithm. This will produce a new tokenized text object with the original words replaced by their stemmed form, which will then be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.

Available Stemmers

The stemmers use the snowball stemmer algorithms to perform stemming. For more information, visit the snowball website at http://snowball.tartarus.org/. The available stemmers are:

• Armenian

• Basque

• Catalan

• Danish

• Dutch

• English

• Finnish

• French

• German

• Hungarian

• Irish

• Italian

• Lovins

• Norwegian

• Porter

• Portuguese

• Romanian

• Russian

• Spanish

• Swedish

• Turkish

Code Example

This example demonstrates using the TextStemmer operator to stem a text field with the Porter stemmer algorithm.

Using the TextStemmer operator in Java

//Create a TextStemmer operator
TextStemmer stemmer = graph.add(new TextStemmer("messageTokens");
stemmer.setOutputField("stemmedTokens");
stemmer.setStemmerType(StemmerType.PORTER);

Using the TextStemmer operator in JavaScript

//Create a TextStemmer operator
var results = dr.textStemmer(data, {inputField:"messageTokens", outputField:"stemmedTokens", stemmerType:"PORTER"});

Properties

The TextStemmer operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to stem. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the stemmed tokenized text object. If unspecified will overwrite the original input field.
stemmerType	StemmerType	The stemmer algorithm to apply to the input.

Ports

The TextStemmer operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data to be stemmed.

The TextStemmer operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the stemmed tokenized text object field produced from the input data.

ExpandTextTokens Operator

The ExpandTextTokens operator can be used to expand a tokenized text field. The operator will create a new string field in the output, which it will then expand the tokenized text object into based on the token type specified, with one token per copied row. This will cause an expansion of the original input data since the rows associated with the original tokenized text object will be duplicated for every token in the output.

The ExpandTextTokens operator has three properties: input field, output field, and the TextElementType to expand in the output. The input field must be a tokenized text object. If there are no tokens of the specified type contained in the tokenized text object, the string output field will contain null for that row.

Code Example

This example demonstrates using the ExpandTextTokens to expand the individual words of the original text into a string field.

Using the ExpandTextTokens operator in Java

//Create an ExpandTextTokens operator
ExpandTextTokens expander = graph.add(new ExpandTextTokens("messageTokens");
expander.setOutputField("words");
expander.setTokenType(TextElementType.WORD);

Using the ExpandTextTokens operator in JavaScript

//Create an ExpandTextTokens operator
var results = dr.expandTextTokens(data, {inputField:"messageTokens", outputField:"sentences", tokenType:"SENTENCE"});

Properties

The ExpandTextTokens operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to expand. If this field does not exist in the input, or is not of type TokenizedText, an exception will be thrown at composition time.
outputField	String	The name of the output field that will contain the strings from the expanded tokenized text object. Defaults to TextElementStrings if unspecified.
tokenType	TextElementType	The type of text token to expand from the original tokenized text. Default: TextElementType.WORD.

Ports

The ExpandTextTokens operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be expanded.

The ExpandTextTokens operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the expanded text produced from the input data.

CalculateWordFrequency Operator

The CalculateWordFrequency operator determines the frequencies of each word in a TokenizedText field. The CalculateWordFrequency operator has two main properties that define the input field that contains the tokenized text, and the output field for the frequency map

The operator will output a WordMap object that contains the words and their associated frequencies. This object can then be used by other operators such as TextFrequencyFilter Operator or ExpandTextFrequency Operator.

Code Example

This example demonstrates using the CalculateWordFrequency operator to determine the frequency of each word in the tokenized text field.

Using the CalculateWordFrequency operator in Java

//Create a CalculateWordFrequency operator
CalculateWordFrequency freqCalc = graph.add(new CalculateWordFrequency("messageTokens");
freqCalc.setOutputField("wordsFrequencies");

Using the CalculateWordFrequency Operator in JavaScript

//Create a CalculateWordFrequency operator
var results = dr.calculateWordFrequency(data, {inputField:"messageTokens", outputField:"wordFrequencies"});

Properties

The CalculateWordFrequency operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to calculate the word frequencies for. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the word lists. Defaults to WordFrequency if unspecified.

Ports

The CalculateWordFrequency operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be used to calculate the frequencies.

The CalculateWordFrequency operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the word frequency map field produced from the input data.

CalculateNGramFrequency Operator

The CalculateNGramFrequency operator determines the frequencies of each n-gram in a TokenizedText field. The CalculateNGramFrequency operator has three main properties which define the input field that contains the tokenized text, the output field for the frequency map, and the n that will be used by the calculation. The operator will output an NGramMap object that contains the n-grams and their associated frequencies.

This object can then be used by other operators such as TextFrequencyFilter Operator or ExpandTextFrequency Operator.

Code Example

This example demonstrates using the CalculateNGramFrequency operator to determine the frequency of each bigram in the tokenized text field.

Using the CalculateNGramFrequency operator in Java

//Create a CalculateNGramFrequency operator
CalculateNGramFrequency freqCalc = graph.add(new CalculateNGramFrequency("messageTokens");
freqCalc.setOutputField("ngramFrequencies");
freqCalc.setN(2);

Using the CalculateNGramFrequency operator in JavaScript

//Create a CalculateNGramFrequency operator
var results = dr.calculateNGramFrequency(data, {inputField:"messageTokens", outputField:"ngramFrequencies"});

Properties

The CalculateNGramFrequency operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to calculate the word frequencies for. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the frequency map. Defaults to NgramFrequency if unspecified.
n	int	The degree of the n-grams that will be used.

Ports

The CalculateNGramFrequency operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be used to calculate the frequencies.

The CalculateNGramFrequency operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the n-gram frequency map fields produced from the input data.

TextFrequencyFilter Operator

The TextFrequencyFilter operator filters a list of frequencies produced by the CalculateWordFrequency or CalculateNGramFrequency operators.

The operator has several properties that must be set to determine the behavior of the filter. The input field for the frequency maps must be set. Additionally a minimum or maximum threshold may be set for the frequencies. This will cause the filter to only keep those absolute frequencies between the minimum or maximum threshold inclusively. Also the total number of top frequencies to keep may be set. This will be applied after determining if the frequencies are within the threshold. Therefore if fewer than the specified number of total frequencies are available, they will all be included in the output.

Any combination of these filtering methods may be applied to the frequencies. If an output field for the filtered frequency maps is unspecified, the original field will be overwritten in the output.

Code Example

This example demonstrates using the TextFrequencyFilter operator to filter all the absolute frequencies below two and keeps the top ten frequencies.

Using the TextFrequencyFilter operator in Java

//Create a TextFrequencyFilter operator
TextFrequencyFilter filter = graph.add(new TextFrequencyFilter("FrequencyMap");
filter.setOutputField("filteredFrequencies");
filter.setMinThreshold(2);
filter.setTotalNumber(10);

Using the TextFrequencyFilter operator in JavaScript

//Create a TextFrequencyFilter operator
var results = dr.textFrequencyFilter (data,{inputField:"FrequencyMap", outputField:"filteredFrequencies", minThreshold:2, totalNumber:10});

Properties

The TextFrequencyFilter has the following properties.

Name	Type	Description
inputField	String	The name of the field containing the word or n-gram frequency map in the input that will be filtered. If this field does not exist in the input or is not a WordMap or NGramMap, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the filtered frequency map. If unspecified, it will overwrite the original inputField.
minThreshold	int	The minimum inclusive threshold for frequencies when filtering.
maxThreshold	int	The maximum inclusive threshold for frequencies when filtering.
totalNumber	int	The total number of top frequencies to keep when filtering.

Ports

The TextFrequencyFilter provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the frequency maps that will be filtered.

The TextFrequencyFilter provides a single output port

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the filtered frequency maps produced from the input data.

ExpandTextFrequency Operator

The ExpandTextFrequency operator expands a frequency map produced by the CalculateWordFrequency or CalculateNGramFrequency operators. If either output field is unspecified the operator will only include the specified output fields in the output. This can be useful if only the text elements or frequencies are needed.

As expected, this operator will cause an expansion of the original input data, and any fields that are not directly expanded will simply be copied. If the row does not have a frequency map to expand, the row will simply be copied into the output with a null indicator inserted into the new fields.

The output of the frequencies can additionally be controlled by setting whether the operator should output relative or absolute frequencies.

Code Example

This example demonstrates using the ExpandTextFrequency operator to expand the word frequencies in the record.

Using the ExpandTextFrequency operator in Java

//Create an ExpandTextFrequency operator
ExpandTextFrequency expander = graph.add(new ExpandTextFrequency("frequencyMap");
expander.setTextOutputField("words");
expander.setFreqOutputField("frequencies");

Using the ExpandTextFrequency operator in JavaScript

//Create an ExpandTextFrequency operator
var results = dr.expandTextFrequency(data, {inputField:"frequencyMap", textOutputField:"words", freqOutputField:"frequencies"});

Properties

The ExpandTextFrequency operator has the following properties.

Name	Type	Description
textInputField	String	The name of the field containing the word or n-gram list in the input that will be expanded. If this field does not exist in the input or is not a WordMap or NGramMap, an exception will be issued at composition time.
textOutputField	String	The name of the output field that will contain the expanded word list. If unspecified, it will not be included in the output.
freqOutputField	String	The name of the output field that will contain the expanded frequency list. If unspecified will not be included in the output.
relative	boolean	Whether absolute or relative frequencies will be output. Default: false for absolute frequencies.

Ports

The ExpandTextFrequency operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the frequency maps that will be expanded.

The ExpandTextFrequency operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the expanded text and frequencies produced from the input data.

GenerateBagOfWords Operator

The GenerateBagOfWords operator can be used to determine all of the distinct words in a TokenizedText field. The GenerateBagOfWords operator has two main properties that define the input field containing the tokenized text and the output field for the words.

Code Example

This example demonstrates using the GenerateBagOfWords operator to determine the frequency of each word in the tokenized text field.

Using the CalculateWordFrequency operator in Java

//Create a GenerateBagOfWords operator
GenerateBagOfWords bow = graph.add(new GenerateBagOfWords("messageTokens");
bow.setOutputField("words");

Using the CalculateWordFrequency operator in JavaScript

//Create a GenerateBagOfWords operator
var results = dr.generateBagOfWords(data, {inputField:"messageTokens", outputField:"words"});

Properties

The GenerateBagOfWords operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field for which to generate the bag of words. If this field does not exist in the input or it is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the word set. Defaults to Word if unspecified.

Ports

The GenerateBagOfWords operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be used to generate the bag of words.

The GenerateBagOfWords operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the word set field produced from the input data.

DrawDiagnosticsChart Operator

To build diagnostic charts, the DrawDiagnosticsChart operator uses confidence values along with the actual target values (true class).

Note: These values are obtained using the input of this operator and output of one or multiple predictors.

The supported chart types are:

ROC chart

Provides a comparison between the true positive rate (y-axis) and false positive rate (x-axis) when the confidence threshold decreases.

Gains chart (CRC, Cumulative Response Chart)

Provides the change in true positive rate (y-axis) when the percentage of the targeted population (x-axis) decreases.

Lift chart

Provides the change in lift (y-axis) which is the ratio between predictor result and baseline when the percentage of the targeted population (x-axis) increases.

The true positive rate is the ratio of all correct positive classifications and total number of positive samples in the test set. The false positive rate is the ratio of all incorrect positive classifications and total number of negative samples in the test set.

The DrawDiagnosticsChart operator can accept one to five predictor operators as input sources. Each predictor output must contain a column for the actual target values and confidence values (probability, score) assigned by the given predictor.

Note: This operator is not parallelizable and runs on a single node in cluster mode.

Output Chart

The following images are examples for the chart types.

/download/attachments/20480524/gains.png?version=1&modificationDate=1415828629471&api=v2

/download/attachments/20480524/lift.png?version=1&modificationDate=1415828629924&api=v2

/download/attachments/20480524/roc.png?version=1&modificationDate=1415828630080&api=v2

Code Example

Using the DrawDiagnosticsChart operator in Java

// Create a diagnostics chart drawer with two input ports
DrawDiagnosticsChart drawer = new DrawDiagnosticsChart(2);
drawer.setConfidenceFieldNames(Arrays.asList("NaiveBayesConfidence", "BaseConfidence"));
drawer.setTargetFieldNames(Arrays.asList("target", "target"));
drawer.setChartNames(Arrays.asList("NAIVE BAYES", "BASE"));
drawer.setChartType(ChartType.GAINS);
drawer.setResultSize(10);
drawer.setTargetValue("E");
drawer.setOutputPath("chart.png");

graph.connect(bayesPredictor.getOutput(), drawer.getInput(0));
graph.connect(basePredictor.getOutput(), drawer.getInput(1));

Using the DrawDiagnosticsChart operator in RushScript

// Using default settings: five (optional) input ports, values for disconnected ports as nulls.
var drawer = dr.drawDiagnosticsChart("chart", bayesPredictor, null, basePredictor, null, null, {
  confidenceFieldNames: ["NaiveBayesConfidence", null, "BaseConfidence", null, null],
  targetFieldNames: ["target", null, "target", null, null],
  chartNames: ["NAIVE BAYES", null, "BASE", null, null],
  outputPath: "chart.png",
  chartType: com.pervasive.datarush.analytics.viz.ChartType.GAINS,
  resultSize: 10,
  targetValue: "E"
});

Properties

The DrawDiagnosticsChart operator provides the following properties.

Name	Type	Description
resultSize	int	The number of points used to create the chart. The sequence of confusion matrices (one per input data point) is split into the given number of equal-sized slices. The result provides the reduced set that is used for creating the chart.
targetValue	String	A value of the target domain that defines the 'true class'.
chartType	ChartType	ROC, Gains, or Lift.
outputPath	String	The path (local file system or HDFS) of the output file (PNG). This property is optional. If the path is not defined, then the data is not written to any file.
chartNames	List<String>	The predictor names for each input port that is displayed as the chart legend. Note: For disconnected input ports, a null value is entered.
confidenceFieldNames	List<String>	The confidence field names for each input port. For any given field name, a field of numeric data type must exist in the corresponding input port schema. Note: For disconnected input ports, a null value is entered.
targetFieldNames	List<String>	The actual target field names for all input ports. For each given field name, a field of string data type must exist in the corresponding input port schema. Note: For disconnected input ports, a null value is entered

Ports

The DrawDiagnosticsChart operator provides an arbitrary number of input ports. It is configured using constructor arguments. The default setting is five input ports. Only port 0 is mandatory.

Name	Type	Get Method	Description
input_x	RecordPort	getInput(int)	The predictor data containing a confidence field and target field.

The DrawDiagnosticsChart operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The confusion matrix data and the x-axis and y-axis values for the diagnostic chart. Generally, this port can be dismissed and the operator can be a sink operator.

Last modified date: 03/10/2025