Statistics

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Statistics

Was this helpful?

Statistics

The DataFlow operator library contains several prebuilt statistics and data summarizer operators. This topic covers each of those operators and provides details on how to use them.

Statistics Operators

• Using the DataQualityAnalyzer Operator to Analyze Data Quality

• Using the SummaryStatistics Operator to Calculate Data Statistics

• Using the DistinctValues Operator to Find Distinct Values

• Using the NormalizeValues Operator to Normalize Values

• Using the Rank Operator to Rank Data

• Using the SumOfSquares Operator to Compute a Sum of Squares

• Using the SummaryStatistics Operator to Calculate Data Statistics

• Using the CountRanges Operator to Count Ranges

• Using the MostFrequentValues Operator to Find Common Values

Using the DataQualityAnalyzer Operator to Analyze Data Quality

The DataQualityAnalyzer operator is used to evaluate a set of quality tests on an input data set. Those records for which all tests pass are considered "clean" and are thus sent to the clean output. Those records for which any tests fail are considered "dirty" and are thus sent to the dirty output.

This operator also produces a PMML summary model that includes the following statistics:

• totalFrequency: Total number of rows.

• invalidFrequency: Total number of rows for which at least one test involving the given field failed.

• testFailureCounts: Per-test failure counts for each test involving the given field.

Using Expressions to Create Quality Metrics

Quality metrics can be specified by using the expression language. Any number of quality metrics can be specified by passing a single expression directly to the DataQualityAnalyzer operator. The syntax of a quality metric expression is:

<predicate expression 1> as <metric name 1>[, <predicate expression 2> as <metric name 2>, ...]

Each expression must be a predicate expression that returns a boolean value. For example, the following expression can be passed directly to the DataQualityAnalyzer, assuming your input has the specified input fields:

class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero

As with field names used elsewhere within expressions, the metric name can be surrounded by back-ticks if it contains non-alphanumeric characters, such as in the expression:

class is not null as `class-not-null`

For more information about syntax and available functions, see the Expression Language.

Code Example

This example demonstrates using the DataQualityAnalyzer operator to ensure the "class" field is non-null and that the petal measurements are greater than zero. This example uses a quality metric expression to specify the metrics to apply to the input data.

Using the DataQualityAnlayzer operator in Java

// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
String qualityTests =
    "class is not null as class_not_null, " +
    "`petal length` > 0 as length_gt_zero, " +
    "`petal width` > 0 as width_gt_zero";
dqa.setTests(qualityTests);

This example demonstrates using the DataQualityAnalyzer operator creating QualityTest instances directly.

Using the DataQualityAnlayzer operator in Java

// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
QualityTest test1 = new QualityTest(
                "class_not_null",
                Predicates.notNull("class"));
QualityTest test2 = new QualityTest(
                "length_gt_zero",
                Predicates.gt(FieldReference.value("petal length"), ConstantReference.constant(0)));
QualityTest test3 = new QualityTest(
                "width_gt_zero",
                Predicates.gt(FieldReference.value("petal width"), ConstantReference.constant(0)));
dqa.setTests(Arrays.asList(test1, test2, test3));

Using the DataQualityAnalyzer operator in RushScript

var results = dr.dataQualityAnalyzer(data, {tests:'class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero'});

Properties

The DataQualityAnalyzer operator provides the following properties.

Name	Type	Description
tests	List<DataQualityAnalyzer.QualityTest> or String	The set of tests to apply to the input data set. The quality tests can be specified using an expression (String) or as a list of QualityTest instances.

Ports

The DataQualityAnalyzer operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data set to be tested.

The DataQualityAnalyzer operator provides the following output ports.

Name	Type	Get Method	Description
clean	RecordPort	getClean()	The output port for the "clean" rows.
dirty	RecordPort	getDirty()	The output port for the "dirty" rows.
model	PMMLPort	getModel()	The output port for the PMML statistics model.

Using the SummaryStatistics Operator to Calculate Data Statistics

The SummaryStatistics operator discovers various metrics of an input data set based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.

If the detail level is SINGLE_PASS_ONLY_SIMPLE, the following statistics are calculated.

Statistic	Description	Field Types
Missing count	Number of missing values per field	all
Min	The minimum value per field	all
Max	The maximum value per field	all
Mean	The mean value per field	int, long, float, double, numeric
Stddev	The standard deviation per field	int, long, float, double, numeric
Variance	The variance per field	int, long, float, double, numeric
Sum	The sum per field	int, long, float, double, numeric
Sum of squares	The sum of squares per field	int, long, float, double, numeric

If the detail level is SINGLE_PASS_ONLY, all of the statistics that are calculated for SINGLE_PASS_ONLY_SIMPLE are calculated. In addition, the following are also calculated.

Statistic	Description	Field Types
Correlation	A matrix is produced where the elements correspond to correlation of pairs of fields.	int, long, float, double, numeric
Covariance	A matrix is produced where the elements correspond to covariance of pairs of fields.	int, long, float, double, numeric

If the detail level is MULTI_PASS, all of the statistics that are calculated for SINGLE_PASS_ONLY are calculated. In addition, the following are also calculated.

Statistic	Description	Field Types
Intervals	Includes counts, sums, and sum of squares for equal-sized intervals. The number of intervals is configurable through the rangeCount property.	int, long, float, double, numeric
Value Counts	Includes the most frequent values and their counts for each field. The number of values to calculate per field is configurable through the showTopHowMany property.	all
Quantiles	The per-field quantiles (equi-depth histograms). The quantiles to calculate are configurable through the quantilesToCalculate property.	int, long, float, double, numeric
Median	The per-field median value.	int, long, float, double, numeric
Inter-quartile-range	The per-field inter-quartile-range.	int, long, float, double, numeric

IMPORTANT! The correct data type must be selected to avoid overflows. If overflows occur, try increasing the size of the data type from float to double or double to numeric.

Code Example

This example calculates summary statistics for the Iris data set. The SummaryStatistics operator produces a PMML model containing summary statistics. This example writes the PMML to a file. It also obtains an in-memory reference to the statistics and outputs to a file.

Using the SummaryStatistics operator in Java

import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.pmml.PMMLModel;
import com.pervasive.datarush.analytics.pmml.PMMLPort;
import com.pervasive.datarush.analytics.pmml.WritePMML;
import com.pervasive.datarush.analytics.stats.DetailLevel;
import com.pervasive.datarush.analytics.stats.PMMLSummaryStatisticsModel;
import com.pervasive.datarush.analytics.stats.SummaryStatistics;
import com.pervasive.datarush.analytics.stats.UnivariateStats;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.model.GetModel;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;

public class IrisSummaryStats {
    public static void main(String[] args) {

        // Create an empty logical graph
        LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("SummaryStats");

        // Create a delimited text reader for the Iris data
        ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
        reader.setFieldSeparator(" ");
        reader.setHeader(true);
        RecordTokenType irisType = record(
                DOUBLE("sepal length"),
                DOUBLE("sepal width"),
                DOUBLE("petal length"),
                DOUBLE("petal width"),
                STRING("class"));
        reader.setSchema(TextRecord.convert(irisType));

        // Run summary statistics on the data and normalized values
        SummaryStatistics summaryStats = graph.add(new SummaryStatistics());
        summaryStats.setDetailLevel(DetailLevel.MULTI_PASS);
        summaryStats.setShowTopHowMany(25);
        graph.connect(reader.getOutput(), summaryStats.getInput());

        // Use the GetModel operator to obtain a reference to the statistics model.
        // This reference is valid after the graph is run and can be used to then
        // access the statistics model outside of the graph.
        GetModel<PMMLModel> modelOp = graph.add(new GetModel<PMMLModel>(PMMLPort.FACTORY));
        graph.connect(summaryStats.getOutput(), modelOp.getInput());

        // Write the PMML generated by summary stats
        WritePMML pmmlWriter = graph.add(new WritePMML("results/iris-summarystats.pmml"));
        graph.connect(summaryStats.getOutput(), pmmlWriter.getModel());

        // Compile and run the graph
        graph.run();

        // Use the model reference to get the actual stats model
        PMMLSummaryStatisticsModel statsModel = (PMMLSummaryStatisticsModel) modelOp.getModel();

        // Print out stats for the numeric fields
        for (String fieldName : new String[] {"sepal length", "sepal width", "petal length", "petal width"}) {
            UnivariateStats fieldStats = statsModel.getFieldStats(fieldName);
            System.out.println("Field: " + fieldName);
            System.out.println("  frequency = " + fieldStats.getTotalFrequency());
            System.out.println("  missing   = " + fieldStats.getMissingFrequency());
            System.out.println("  min       = " + fieldStats.getNumericInfo().getMin());
            System.out.println("  max       = " + fieldStats.getNumericInfo().getMax());
            System.out.println("  mean      = " + fieldStats.getNumericInfo().getMean());
            System.out.println("  stddev    = " + fieldStats.getNumericInfo().getStddev());
        }
    }
}

Using the SummaryStatistics operator in RushScript

var results = dr.summaryStatistics(data, {includedFields:"sepal length", detailLevel:DetailLevel.MULTI_PASS});

Properties

The SummaryStatistics operator has the following properties.

Name	Type	Description
detailLevel	DetailLevel	The detail level that is used to compute statistics. The default value is SINGLE_PASS_ONLY.
showTopHowMany	int	Provides a cap on the number of value counts to calculate. Default: 25. Memory usage is proportional to the number of distinct values; thus only the top showTopHowMany values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if the detail level is SINGLE_PASS_ONLY.
rangeCount	int	The number of interval counts to calculate for each numeric field. The default value is 10. This setting is ignored if detail level is SINGLE_PASS_ONLY.
quantilesToCalculate	List<BigDecimal>	The quantiles to calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles).
includedFields	List<String>	The fields from the input data set for which we are collecting statistics. The default value of "empty list" implies "all fields".
fewDistinctValuesHint	boolean	This should be set to true if a small number of values per column is expected. This will have a large performance benefit, particularly in the cluster, since we can then avoid the overhead of parallelizing computation of quantiles, and so on.

Ports

The SummaryStatistics operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data set that is used to build the summary model.

The SummaryStatistics operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	Returns a PMML model corresponding to the ModelStats element described at http://www.dmg.org/v4-0-1/Statistics.html.

Using the DistinctValues Operator to Find Distinct Values

The DistinctValues operator calculates the distinct values of the given input field. This produces a record consisting of the input field with only the distinct values, and a count field with the number of occurrences of each value.

Code Example

This example calculates the number of distinct types of iris present in the data set.

Using the DistinctValue operator in Java

Using the DistinctValues operator in RushScript

var results = dr.distinctValues(data, {inputField:"class"});

Properties

The DistinctValues operator provides the following properties.

Name	Type	Description
inputField	String	The input field for which we calculate distinct values.
sortByCount	boolean	Whether to sort the output by value count.

Ports

The DistinctValues operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The DistinctValues operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output consisting of two fields, the distinct values and the counts.

Using the NormalizeValues Operator to Normalize Values

The NormalizeValues operator applies normalization methods to fields within an input data flow. The results of the normalization methods are available in the output flow. All input fields are present in the output with the addition of the calculated normalizations.

Normalization methods require certain statistics about the input data such as the mean, standard deviation, minimum value, maximum value, and so on. These statistics are captured in a PMMLModel. The statistics can be gathered by an upstream operator such as SummaryStatistics and passed into this operator. If not, they will be calculated with a first pass over the data and then applied in a second pass.

Code Example

This example normalizes the Iris data set using the z-score method.

Using the NormalizeValues operator in Java

Using the NormalizeValues operator in RushScript

var scoreFields = ["sepal length", "sepal width", "petal length", "petal width"];
var results = dr.normalizeValues(data, {scoreFields:scoreFields, method:NormalizeMethod.ZSCORE});

Properties

The NormalizeValues operator provides the following properties.

Name	Type	Description
includeInputFields	boolean	The indicator of whether to include the input fields in the output data. Setting this property to true causes the input values to be transferred to the output. Otherwise the input values are excluded, leaving only the transformed fields in the output data. Default: true.
method	NormalizeMethod	The normalization method to use.
scoreFields	List<String>	The names of the input fields to normalize. If no field names are provided, all fields will be transformed by default.

Ports

The NormalizeValues operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.
modelInput	PMMLPort	getModelInput()	The optional input port used to provide the PMML model containing field statistics needed by normalization methods.

The NormalizeValues operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The normalized output data.

Using the Rank Operator to Rank Data

The Rank operator is used to rank data using the given rank mode. The data is grouped by the given partition fields and is sorted within the grouping by the ranking fields. An example is to rank employees by salary per department. To rank the highest to lowest salary within department: partition by the department and rank by the salary in descending sort order.

Three different rank modes are supported:

STANDARD

Also known as competition ranking, items with the same ranking values have the same rank and then a gap is left in the ranking numbers. For example: 1224

DENSE

Items that comparison determines are equal receive the same ranking. Items following those receive the next ordinal ranking (that is, ranks are not skipped). For example: 1223

ORDINAL

Each item receives a distinct ranking, starting at one and increasing by one, producing essentially a row number within the partition. For example: 1234

A new output field is created to contain the result of the ranking. The field is named "rank" by default.

Code Example

In this example we use the Rank operator to order the Iris data set by the "sepal length" field, partitioning by the "class" field.

Using the Rank operator in Java

Using the Rank operator in RushScript

var results = dr.rank(data, {partitionBy:'class', rankBy:'"sepal length" desc', mode:RankMode.STANDARD});

Properties

The Rank operator provides the following properties.

Name	Type	Description
mode	RankMode	The ranking mode. Ordinal ranking is used by default.
outputField	String	The name of the output field containing the result of the ranking. Defaults to "rank".
partitionKeys	List<String>	The fields used to partition the data. Must specify a minimum of at least one field.
rankKeys	List<SortKey>	The fields used to rank the data. A list of Strings can also be used in which case the sort order defaults to 'ascending'. The data within each partition is sorted by the specified order. This specifies the set of fields used to calculate the rating within each partitioned group.

Ports

The Rank operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The Rank operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The original data with the additional rank field.

Using the SumOfSquares Operator to Compute a Sum of Squares

The SumOfSquares operator computes the sum of squares for the given fields in the input data. The inner products are calculated in a distributed fashion with a reduction at the end to produce the sum of squares matrix. Note that all the fields must be of type double or be assignable to a double type.

Code Example

The following example demonstrates computing the sum of squares matrix over three double fields.

Using the SumOfSquares operator in Java

// Calculate the Sum of Squares
SumOfSquares sos = graph.add(new SumOfSquares());
sos.setFieldNames(Arrays.asList(new String[]{"dblfield1", "dblfield2", "dblfield3"}));

Using the SumOfSquares operator in RushScript

var results = dr.sumOfSquares(data, {
fieldNames:['dblfield1', 'dblfield2', 'dblfield3']});

Properties

The SumOfSquares operator provides the following properties.

Name	Type	Description
fieldNames	List<String>	The list of fields to apply sum of squares. The field names must be valid names within the schema of the input port. The fields types must be compatible with the double type.

Ports

The SumOfSquares operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data used to build the model.

The SumOfSquares operator provides a single output port.

Name	Type	Get Method	Description
output	SimpleModelPort	getOutput()	This port will contain the sum of squares for the specified fields in a matrix. Only a single token will be available on this output port.

Using the CountRanges Operator to Count Ranges

The CountRanges operator is used to determine the range bin each value in the input data set falls in. It will calculate the total number of data values that fall within each range. The value ranges are automatically sorted in ascending order and the entire range of possible values is always considered. The operation is defined by a list of breakpoints that are used as the boundaries for the ranges. A list of n breaks defines n+1 range groups which are indexed beginning with 1. The first and last groups are unbounded on one side each. The range groups are sorted in ascending order based on the comparable interface of the field. The behavior of range intervals closures can also be adjusted by enabling closed lower or upper bounds which will include values equal to the boundary in the respective interval. A value can only be included in a single range group so both the lower and upper bound cannot be closed. Any value which is not included in any range group such as null or the boundary values is included in group 0.

A new output field is created to contain the range group of the specified field. The field is named after the original field with "_RangeGroup" appended to the name. The statistics output of this operator outputs the counts of the defined range groups. This output includes two fields, the range group index and the total number of values within that group.

Code Example

In this example, we use the CountRanges operator to count values of the Iris data set and bin the values of the "petal length" field.

Using the CountRanges operator in Java

// Create an empty logical graph

LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("CountRangesIris");

//Create a delimited text reader for the Iris data

ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));

reader.setFieldSeparator(",");

reader.setHeader(true);

RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));

reader.setSchema(TextRecord.convert(irisType));

//To ensure sort order is preserved

AssertSorted assertSort = graph.add(new AssertSorted());

assertSort.setOrdering("petal length");

graph.connect(reader.getOutput(), assertSort.getInput());

//Initialize CountRanges Operator

CountRanges countRanges = graph.add(new CountRanges());

countRanges.setFieldName(fieldName);

countRanges.setBreaks(breaks);

countRanges.setLowerBoundClosed(lowerClosed);

countRanges.setUpperBoundClosed(upperClosed);

//Connect the reader to CountRanges

graph.connect(assertSort.getOutput(), countRanges.getInput());

// write the data with CountRanges

WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-CountRanges.txt", WriteMode.OVERWRITE));

writer.setFieldDelimiter("");

writer.setHeader(true);

writer.disableParallelism();

String statsOutputPath = outputPath.replace(".txt", "-stats.txt");

WriteDelimitedText statswriter = graph.add(new WriteDelimitedText(statsOutputPath, WriteMode.OVERWRITE));

statswriter.setFieldDelimiter("");

statswriter.setHeader(true);

statswriter.disableParallelism();

//Connect CountRanges to the writer

graph.connect(countRanges.getOutput(), writer.getInput());

graph.connect(countRanges.getStatsOutput(), statswriter.getInput());

// Compile and run the graph

graph.run();

Using the CountRanges operator in RushScript

var breaks = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

var results = dr.countRanges(data, {fieldName:"petal length", upperBoundClosed:true, breaks:breaks})

Properties

The CountRanges operator provides the following properties.

Name	Type	Description
fieldName	String	The name of the field will be divided into ranges.
breaks	List	The values that will be used as the boundaries for the ranges. These should be of the same type as the selected fields.
lowerBoundClosed	boolean	If the lower boundary defined by a range should be included in the group. Default is false.
upperBoundClosed	boolean	If the upper boundary defined by a range should be included in the group. Default is false.

Ports

The CountRagnes operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The CountRanges operator provides two output ports.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The original data with the additional range group field.
statsOutput	RecordPort	getStatsOutput()	The count data for the defined ranges.

Using the MostFrequentValues Operator to Find Common Values

The MostFrequentValues operator is used to determine which values are the most frequent within the selected fields of the input data. A maximum must be specified to indicate how many of the most common values should be output.

The output contains two fields for each field selected from the input. The fields will include the value field from the input with the topmost frequent values and a field associated with each that contains the frequency count.

Code Example

In this example we the MostFrequentValues operator to find the top 5 values in each of the numeric fields in the Iris data set.

Using the MostFrequentValues operator in Java

// Create an empty logical graph

LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("mostFreqValuesIris");

//Create a delimited text reader for the Iris data

ReadDelimitedText reader = graph.add(new ReadDelimitedText(getResourcePath("iris.txt")));

reader.setFieldSeparator(" ");

reader.setHeader(true);

RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));

reader.setSchema(TextRecord.convert(irisType));

// Initialize MostFrequentValues operator

MostFrequentValues mostFreq = graph.add(new MostFrequentValues());

if (fieldNames.length > 0) {

mostFreq.setFieldNames(fieldNames);

}

if (topNum >= 0) {

mostFreq.setShowTopHowMany(topNum);

}

//Connect the reader to MostFrequentValues

graph.connect(reader.getOutput(), mostFreq.getInput());

// write the data with MostFrequentValues

WriteDelimitedText writer = graph.add(new WriteDelimitedText(outputPath, WriteMode.OVERWRITE));

writer.setFieldDelimiter("");

writer.setHeader(true);

writer.disableParallelism();

writer.setSaveMetadata(false);

//Connect MostFrequentValues to the writer

graph.connect(mostFreq.getOutput(), writer.getInput());

// Compile and run the graph

graph.run();

Using the MostFrequentValues operator in RushScript

var freqFields = ["sepal length", "sepal width", "petal length", "petal width"];

var results = dr.MostFrequentValues(data, {fieldNames:freqFields, showTopHowMany:5})

Properties

The MostFrequentValues operator provides the following properties.

Name	Type	Description
fieldName	List<String>	The names of the input fields to calculate frequency.
showTopHowMany	int	The max number of value frequencies to calculate. The default is 25.
fewDistinctValuesHint	boolean	A hint as to whether there are expected to be a small number of distinct values.

Ports

The MostFrequentValues operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The MostFrequentValues operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output consisting of two fields, the frequent values and the counts.

Support Vector Machine

Support Vector Machines within DataFlow

A support vector machine (SVM) is a supervised learning model that analyzes data and recognize patterns. It is often used for data classification.

A basic SVM is a nonprobabilistic binary linear classifier, which means it takes a set of input data and predicts, for each given input, which of two possible classes forms the output. In addition, support vector machines can efficiently perform nonlinear classification using what is called the kernel trick, which implicitly maps their inputs into high-dimensional feature space.

DataFlow provides operators to produce and utilize SVM models. The learner is used to determine the classification rules for a particular data set, while the predictor can apply these rules to a data set.

Covered SVM Operators

• Using the SVMLearner Operator to Build a PMML Support Vector Machine Model

• Using the SVMPredictor Operator to Apply a Support Vector Machine Model

Using the SVMLearner Operator to Build a PMML Support Vector Machine Model

The SVMLearner operator is responsible for building a PMML Support Vector Machine model from input data. It is implemented as a wrapper for the LIBSVM library found at http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/.

Code Example

This example uses the SVMLearner operator to train a predictive model base on the Iris data set. It uses the "class" field within the iris data as the target column.

Using the SVMLearner operator in Java

Using the SVMLearner operator in RushScript

var learningColumns = ["sepal length", "sepal width", "petal length", "petal width"];
var kernel = new PolynomialKernelType().setGamma(3);
var type = new SVMTypeCSvc("class", 1);
var learner = dr.svmLearner(data, {includedColumns:learningColumns, kernelType:kernel, type:type});

Properties

The SVMLearner operator provides the following properties.

Name	Type	Description
epsilon	double	The tolerance for termination criteria. Default: 0.001. Larger values will terminate early but provide less precise results. Directly maps to the LIBSVM "-e" command line flag.
includedColumns	List<String>	The list of columns to include for the purpose of building the model. An empty list means all columns that are of type double.
kernelType	KernelType	The kernel and associated parameters to use.
quiet	boolean	Our default SVM library will send regular status reports to System.out. However, it has a static property that can suppress console output. If console output is desired, set to false. This method has side effects in the static variables of libsvm.svm.
svmCacheSizeMB	double	The cache size. Directly maps to the LIBSVM "-m" command line flag.
type	SVMType	The type of SVM model to build. Defaults to SVMTypeOneClass.

Ports

The SVMLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data. This contains both the independent variables and the target variable (if applicable). The target variable only applies to SVMs of type SVMPredictorType.

The SVMLearner operator provides a single output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	Outputs the Support Vector Machine PMML model.

Using the SVMPredictor Operator to Apply a Support Vector Machine Model

The SVMPredictor operator applies a previously built Support Vector Machine model to the input data. This supports either CSVC SVMs or one-class SVMs.

We distinguish the two cases by the presence of PMMLModelSpec.getTargetCols(). If there are zero target columns, it is assumed to be a one-class SVM. Otherwise, there must be exactly one column of type TokenTypeConstant.STRING, in which case it is a CSVC SVM.

For CSVC SVMs, the PMML is expected to contain Support Vector Machines with SupportVectorMachine.getTargetCategory() and SupportVectorMachine.getAlternateTargetCategory() populated. Each of the SVMs are evaluated, adding a vote to either target category or alternate target category. The predicted value is the one that receives the most votes.

For one-class SVMs, the target category and alternate target category will be ignored. The result will either be -1 if the SVM evaluated to a number less than zero or 1 if greater than zero.

Note: This operator is non-parallel.

Code Examples

Example Usage of the SVMPredictor Operator in Java

// Create the SVM predictor operator and add it to a graph
SVMPredictor predictor = graph.add(new SVMPredictor());

// Connect the predictor to an input port and a model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the SVMPredictor operator in RushScript

var results = dr.svmPredictor(learner, data);

Properties

The SVMPredictor operator has no properties.

Ports

The SVMPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data to which the model is applied.
model	PMMLPort	getModel()	SVM model in PMML to apply.

The SVMPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the model to the input data.

Text Processing

Text Processing within DataFlow

Text processing is the process of deriving information from unstructured text. This is accomplished by first structuring the text in a form that can be analyzed and then performing various transformations and statistical techniques on the text.

The DataFlow text processing library provides operators that can perform basic text mining and processing tasks on unstructured text. The primary operator is the TextTokenizer operator, which analyzes the text within a string field of a record and creates an object that represents a structured form of the original text. This TokenizedText object can then be used by a variety of other operators within the library to perform various transformations and statistical analysis on the text.

This section covers each of those operators and provides details on how to use them.

Text Processing Operators

• Using the TextTokenizer Operator to Tokenize Text Strings

• Using the CountTokens Operator to Count Tokens

• Using the FilterText Operator to Filter Tokenized Text

• Using the DictionaryFilter Operator to Filter Based on Dictionaries

• Using the ConvertTextCase Operator to Convert Case

• Using the TextStemmer Operator to Stem Text

• Using the ExpandTextTokens Operator to Expand Text Tokens

• Using the CalculateWordFrequency Operator to Calculate Word Frequencies

• Using the CalculateNGramFrequency Operator to Calculate N-gram Frequencies

• Using the TextFrequencyFilter Operator to Filter Frequencies

• Using the ExpandTextFrequency Operator to Expand Text Frequencies

• Using the GenerateBagOfWords Operator to Expand Text Frequencies

Using the TextTokenizer Operator to Tokenize Text Strings

The TextTokenizer operator tokenizes a string field in the source and produces a field containing a TokenizedText object. The TextTokenizer operator has two main properties that determine the string field in the input that should be tokenized and the object field in the output that will store the encoded TokenizedText object. The contents of the string field will be parsed and tokenized, creating a TokenizedText object that will be encoded into the output field. This TokenizedText object can then be used by downstream operators for further text processing tasks.

Code Example

This example demonstrates using the TextTokenizer operator to tokenize a message field in a record.

Using the TextTokenizer operator in Java

//Create a TextTokenizer operator
TextTokenizer tokenizer = graph.add(new TextTokenizer("messageField"));
tokenizer.setOutputField("messageTokens");

Using the TextTokenizer operator in JavaScript

//Create a TextTokenizer operator
var results = dr.textTokenizer(data, {inputField:"messageField", outputField:"messageTokens"});

Properties

The TextTokenizer operator has the following properties.

Name	Type	Description
inputField	String	The name of the String field to tokenize. If this field does not exist in the input, or is not of type String, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the tokenized text object. Defaults to TokenizedTextField if unspecified.
wordPatterns	List<String>	A list of regular expressions that will be used to find custom word patterns while tokenizing the text.

Ports

The TextTokenizer operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the string data that will be tokenized.

The TextTokenizer operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the tokenized text object field produced from the input data.

Using the CountTokens Operator to Count Tokens

The CountTokens operator counts the number of a particular type of token in a TokenizedText field. The CountTokens operator has two main properties that define the input field that contains the tokenized text with tokens to count and the name of the output field that should contain the counts. By default the operator will count the number of word tokens; however, this property can be modified to count any valid TextElementType.

Code Example

This example demonstrates using the CountTokens operator to count the number of sentence tokens in the tokenized text field.

Using the CountTokens operator in Java

//Create a CountTokens operator
CountTokens counter = graph.add(new CountTokens("messageTokens");
counter.setOutputField("sentenceCount");
counter.setTokenType(TextElementType.SENTENCE);

Using the CountTokens operator in JavaScript

//Create a CountTokens operator
var results = dr.countTokens(data, {inputField:"messageTokens", outputField:"wordCount"});

Properties

The CountTokens operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field with tokens to count. If this field does not exist in the input, or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the count. Defaults to TokenCount if unspecified.
tokenType	TextElementType	The specific type of token to count. Defaults to TextElementType.WORD.

Ports

The CountTokens operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be counted.

The CountTokens operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the token count field produced from the input data.

Using the FilterText Operator to Filter Tokenized Text

The FilterText operator filters a tokenized text field in the source and produces a field containing a filtered TokenizedText object.

The FilterText operator has three properties: input field, output field, and the list of text filters that will be applied to the input. The input field must be a tokenized text object. The tokenized text object will be filtered of all tokens that are specified by the text filters. This will produce a new tokenized text object that will be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.

Available Filters

LengthFilter

Filters all words with a length less than or equal to the specified length.

PunctuationFilter

Filters all standalone punctuation tokens. Will not remove punctuation that is part of a word such as an apostrophe or hyphen.

RegexFilter

Filters all words that match against the supplied regular expression.

TextElementFilter

Filters all text elements in the hierarchy that are higher than the specified element. The default hierarchy is Document, Paragraph, Sentence, Word.

WordFilter

Removes any words in a provided list.

All the available filters have the option of inverting the filter. This has the effect of keeping all the words that pass the filter instead of those that fail, and effectively inverts the output.

Code Example

This example demonstrates using the FilterText operator to filter out XML/HTML tags and punctuation.

Using the FilterText operator in Java

//Create a FilterText operator
FilterText filter = graph.add(new FilterText("messageTokens");
filter.setOutputField("filteredTokens");
filter.setTextFilters( new PunctuationFilter(),
new RegexFilter("<(\"[^\"]*\"|'[^']*'|[^'\">])*>"));

Using the FilterText operator in JavaScript

//Create a FilterText operator
var results = dr.filterText(data, {inputField:"messageTokens", outputField:"filteredTokens",
textFilters:[new PunctuationFilter()]});

Properties

The FilterText operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to filter. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the filtered tokenized text object. If unspecified, it will overwrite the original input field.
textFilters	TextFilter[]	The list of text filters to apply to the input data. The filters will be applied in the order they are present in the list.

Ports

The FilterText operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be filtered.

The FilterText operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the filtered tokenized text object field produced from the input data.

Using the DictionaryFilter Operator to Filter Based on Dictionaries

The DictionaryFilter operator filters a tokenized text field in the source based on a dictionary. This will produce a field containing a filtered TokenizedText object in the output.

The DictionaryFilter operator has four properties: input field, output field, dictionary input field, and whether the filter should be inverted. The input field must be a tokenized text object.

The tokenized text object will be filtered of all words that are specified in the dictionary input. This will produce a new tokenized text object that will be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.

Code Example

This example demonstrates using the DictionaryFilter operator to filter out stop words.

Using the DictionaryFilter operator in Java

//Create a DictionaryFilter operator
DictionaryFilter filter = graph.add(new DictionaryFilter("messageTokens");
filter.setOutputField("filteredTokens");
filter.setDictionaryField("dictionary");

Using the DictionaryFilter operator in JavaScript

//Create a DictionaryFilter operator
var results = dr.dictionaryFilter(data, {inputField:"messageTokens",outputField:"filteredTokens",dictionaryField:"dictionary"});

Properties

The DictionaryFilter operator has the following properties.

Name	Type	Description
dictionaryField	String	The name of the dictionary field in the dictionary input.
inputField	String	The name of the tokenized text field to filter. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
inverted	Boolean	Specifies whether the filter must be inverted
outputField	String	The name of the output field that will contain the filtered tokenized text object. If unspecified, it will overwrite the original input field.

Ports

The DictionaryFilter operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be filtered.
dictInput	RecordPort	getInput()	The input data with the dictionary string that will be filtered.

The DictionaryFilter operator provides a single output port.

Name	Type	Get Method	Description
output	RnmecordPort	getOutput()	The output data including the filtered tokenized text object field produced from the input data.

Using the ConvertTextCase Operator to Convert Case

The ConvertTextCase operator performs case conversions on a tokenized text object and produces a field containing the modified tokenized text object. The operator will convert all the characters in the individual text tokens into upper- or lowercase depending on the settings and will produce a new tokenized text object with the specified case conversions applied to each token.

The ConvertTextCase operator has three properties: input field, output field, and the case used for the conversion. The input field must be a tokenized text object, and the output field will similarly be a tokenized text object. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. The new tokenized text object can then be used for further text processing tasks.

Code Example

This example demonstrates using the ConvertTextCase operator to convert all tokens into lowercase.

Using the ConvertTextCase operator in Java

//Create a ConvertTextCase operator
ConvertTextCase converter = graph.add(new ConvertTextCase("messageTokens");
converter.setOutputField("convertedTokens");
converter.setCaseFormat(Case.LOWER);

Using the ConvertTextCase Operator in JavaScript

//Create a ConvertTextCase operator
var results = dr.convertTextCase(data, {inputField:"messageTokens", outputField:"convertedTokens", caseFormat:"LOWER"});

Properties

The ConvertTextCase operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to convert. If this field does not exist in the input, or is not of type TokenizedText, an exception will be thrown at composition time.
outputField	String	The name of the output field that will contain the converted tokenized text object. If unspecified, it will overwrite the original input field.
caseFormat	Case	The case format to use. Can be set to LOWER or UPPER. Default: LOWER.

Ports

The ConvertTextCase operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be converted.

The ConvertTextCase operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the converted tokenized text object field produced from the input data.

Using the TextStemmer Operator to Stem Text

Stemming is the process for removing the commoner morphological and inflexional endings from words.

The TextStemmer operator stems a tokenized text field in the source and produces a field containing a stemmed TokenizedText object.

The TextStemmer operator has three properties: input field, output field, and the stemmer to use. The input field must be a tokenized text object. Each of the words in the tokenized text object will be stemmed using the rules defined by the specified stemmer algorithm. This will produce a new tokenized text object with the original words replaced by their stemmed form, which will then be encoded into the output field. If the output field is unspecified, the original input field will be overwritten with the new tokenized text object. This object can then be used for further text processing tasks.

Available Stemmers

The stemmers use the snowball stemmer algorithms to perform stemming. For more information, visit the snowball website at http://snowball.tartarus.org/. The available stemmers are:

• Armenian

• Basque

• Catalan

• Danish

• Dutch

• English

• Finnish

• French

• German

• Hungarian

• Irish

• Italian

• Lovins

• Norwegian

• Porter

• Portuguese

• Romanian

• Russian

• Spanish

• Swedish

• Turkish

Code Example

This example demonstrates using the TextStemmer operator to stem a text field with the Porter stemmer algorithm.

Using the TextStemmer operator in Java

//Create a TextStemmer operator
TextStemmer stemmer = graph.add(new TextStemmer("messageTokens");
stemmer.setOutputField("stemmedTokens");
stemmer.setStemmerType(StemmerType.PORTER);

Using the TextStemmer operator in JavaScript

//Create a TextStemmer operator
var results = dr.textStemmer(data, {inputField:"messageTokens", outputField:"stemmedTokens", stemmerType:"PORTER"});

Properties

The TextStemmer operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to stem. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the stemmed tokenized text object. If unspecified will overwrite the original input field.
stemmerType	StemmerType	The stemmer algorithm to apply to the input.

Ports

The TextStemmer operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data to be stemmed.

The TextStemmer operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the stemmed tokenized text object field produced from the input data.

Using the ExpandTextTokens Operator to Expand Text Tokens

The ExpandTextTokens operator can be used to expand a tokenized text field. The operator will create a new string field in the output, which it will then expand the tokenized text object into based on the token type specified, with one token per copied row. This will cause an expansion of the original input data since the rows associated with the original tokenized text object will be duplicated for every token in the output.

The ExpandTextTokens operator has three properties: input field, output field, and the TextElementType to expand in the output. The input field must be a tokenized text object. If there are no tokens of the specified type contained in the tokenized text object, the string output field will contain null for that row.

Code Example

This example demonstrates using the ExpandTextTokens to expand the individual words of the original text into a string field.

Using the ExpandTextTokens operator in Java

//Create an ExpandTextTokens operator
ExpandTextTokens expander = graph.add(new ExpandTextTokens("messageTokens");
expander.setOutputField("words");
expander.setTokenType(TextElementType.WORD);

Using the ExpandTextTokens operator in JavaScript

//Create an ExpandTextTokens operator
var results = dr.expandTextTokens(data, {inputField:"messageTokens", outputField:"sentences", tokenType:"SENTENCE"});

Properties

The ExpandTextTokens operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to expand. If this field does not exist in the input, or is not of type TokenizedText, an exception will be thrown at composition time.
outputField	String	The name of the output field that will contain the strings from the expanded tokenized text object. Defaults to TextElementStrings if unspecified.
tokenType	TextElementType	The type of text token to expand from the original tokenized text. Default: TextElementType.WORD.

Ports

The ExpandTextTokens operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be expanded.

The ExpandTextTokens operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the expanded text produced from the input data.

Using the CalculateWordFrequency Operator to Calculate Word Frequencies

The CalculateWordFrequency operator determines the frequencies of each word in a TokenizedText field. The CalculateWordFrequency operator has two main properties that define the input field that contains the tokenized text, and the output field for the frequency map

The operator will output a WordMap object that contains the words and their associated frequencies. This object can then be used by other operators such as Using the TextFrequencyFilter Operator to Filter Frequencies or Using the ExpandTextFrequency Operator to Expand Text Frequencies.

Code Example

This example demonstrates using the CalculateWordFrequency operator to determine the frequency of each word in the tokenized text field.

Using the CalculateWordFrequency operator in Java

//Create a CalculateWordFrequency operator
CalculateWordFrequency freqCalc = graph.add(new CalculateWordFrequency("messageTokens");
freqCalc.setOutputField("wordsFrequencies");

Using the CalculateWordFrequency Operator in JavaScript

//Create a CalculateWordFrequency operator
var results = dr.calculateWordFrequency(data, {inputField:"messageTokens", outputField:"wordFrequencies"});

Properties

The CalculateWordFrequency operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to calculate the word frequencies for. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the word lists. Defaults to WordFrequency if unspecified.

Ports

The CalculateWordFrequency operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be used to calculate the frequencies.

The CalculateWordFrequency operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the word frequency map field produced from the input data.

Using the CalculateNGramFrequency Operator to Calculate N-gram Frequencies

The CalculateNGramFrequency operator determines the frequencies of each n-gram in a TokenizedText field. The CalculateNGramFrequency operator has three main properties which define the input field that contains the tokenized text, the output field for the frequency map, and the n that will be used by the calculation. The operator will output an NGramMap object that contains the n-grams and their associated frequencies.

This object can then be used by other operators such as Using the TextFrequencyFilter Operator to Filter Frequencies or Using the ExpandTextFrequency Operator to Expand Text Frequencies.

Code Example

This example demonstrates using the CalculateNGramFrequency operator to determine the frequency of each bigram in the tokenized text field.

Using the CalculateNGramFrequency operator in Java

//Create a CalculateNGramFrequency operator
CalculateNGramFrequency freqCalc = graph.add(new CalculateNGramFrequency("messageTokens");
freqCalc.setOutputField("ngramFrequencies");
freqCalc.setN(2);

Using the CalculateNGramFrequency operator in JavaScript

//Create a CalculateNGramFrequency operator
var results = dr.calculateNGramFrequency(data, {inputField:"messageTokens", outputField:"ngramFrequencies"});

Properties

The CalculateNGramFrequency operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field to calculate the word frequencies for. If this field does not exist in the input or is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the frequency map. Defaults to NgramFrequency if unspecified.
n	int	The degree of the n-grams that will be used.

Ports

The CalculateNGramFrequency operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be used to calculate the frequencies.

The CalculateNGramFrequency operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the n-gram frequency map fields produced from the input data.

Using the TextFrequencyFilter Operator to Filter Frequencies

The TextFrequencyFilter operator filters a list of frequencies produced by the CalculateWordFrequency or CalculateNGramFrequency operators.

The operator has several properties that must be set to determine the behavior of the filter. The input field for the frequency maps must be set. Additionally a minimum or maximum threshold may be set for the frequencies. This will cause the filter to only keep those absolute frequencies between the minimum or maximum threshold inclusively. Also the total number of top frequencies to keep may be set. This will be applied after determining if the frequencies are within the threshold. Therefore if fewer than the specified number of total frequencies are available, they will all be included in the output.

Any combination of these filtering methods may be applied to the frequencies. If an output field for the filtered frequency maps is unspecified, the original field will be overwritten in the output.

Code Example

This example demonstrates using the TextFrequencyFilter operator to filter all the absolute frequencies below two and keeps the top ten frequencies.

Using the TextFrequencyFilter operator in Java

//Create a TextFrequencyFilter operator
TextFrequencyFilter filter = graph.add(new TextFrequencyFilter("FrequencyMap");
filter.setOutputField("filteredFrequencies");
filter.setMinThreshold(2);
filter.setTotalNumber(10);

Using the TextFrequencyFilter operator in JavaScript

//Create a TextFrequencyFilter operator
var results = dr.textFrequencyFilter (data,{inputField:"FrequencyMap", outputField:"filteredFrequencies", minThreshold:2, totalNumber:10});

Properties

The TextFrequencyFilter has the following properties.

Name	Type	Description
inputField	String	The name of the field containing the word or n-gram frequency map in the input that will be filtered. If this field does not exist in the input or is not a WordMap or NGramMap, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the filtered frequency map. If unspecified, it will overwrite the original inputField.
minThreshold	int	The minimum inclusive threshold for frequencies when filtering.
maxThreshold	int	The maximum inclusive threshold for frequencies when filtering.
totalNumber	int	The total number of top frequencies to keep when filtering.

Ports

The TextFrequencyFilter provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the frequency maps that will be filtered.

The TextFrequencyFilter provides a single output port

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the filtered frequency maps produced from the input data.

Using the ExpandTextFrequency Operator to Expand Text Frequencies

The ExpandTextFrequency operator expands a frequency map produced by the CalculateWordFrequency or CalculateNGramFrequency operators. If either output field is unspecified the operator will only include the specified output fields in the output. This can be useful if only the text elements or frequencies are needed.

As expected, this operator will cause an expansion of the original input data, and any fields that are not directly expanded will simply be copied. If the row does not have a frequency map to expand, the row will simply be copied into the output with a null indicator inserted into the new fields.

The output of the frequencies can additionally be controlled by setting whether the operator should output relative or absolute frequencies.

Code Example

This example demonstrates using the ExpandTextFrequency operator to expand the word frequencies in the record.

Using the ExpandTextFrequency operator in Java

//Create an ExpandTextFrequency operator
ExpandTextFrequency expander = graph.add(new ExpandTextFrequency("frequencyMap");
expander.setTextOutputField("words");
expander.setFreqOutputField("frequencies");

Using the ExpandTextFrequency operator in JavaScript

//Create an ExpandTextFrequency operator
var results = dr.expandTextFrequency(data, {inputField:"frequencyMap", textOutputField:"words", freqOutputField:"frequencies"});

Properties

The ExpandTextFrequency operator has the following properties.

Name	Type	Description
textInputField	String	The name of the field containing the word or n-gram list in the input that will be expanded. If this field does not exist in the input or is not a WordMap or NGramMap, an exception will be issued at composition time.
textOutputField	String	The name of the output field that will contain the expanded word list. If unspecified, it will not be included in the output.
freqOutputField	String	The name of the output field that will contain the expanded frequency list. If unspecified will not be included in the output.
relative	boolean	Whether absolute or relative frequencies will be output. Default: false for absolute frequencies.

Ports

The ExpandTextFrequency operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the frequency maps that will be expanded.

The ExpandTextFrequency operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the expanded text and frequencies produced from the input data.

Using the GenerateBagOfWords Operator to Expand Text Frequencies

The GenerateBagOfWords operator can be used to determine all of the distinct words in a TokenizedText field. The GenerateBagOfWords operator has two main properties that define the input field containing the tokenized text and the output field for the words.

Code Example

This example demonstrates using the GenerateBagOfWords operator to determine the frequency of each word in the tokenized text field.

Using the CalculateWordFrequency operator in Java

//Create a GenerateBagOfWords operator
GenerateBagOfWords bow = graph.add(new GenerateBagOfWords("messageTokens");
bow.setOutputField("words");

Using the CalculateWordFrequency operator in JavaScript

//Create a GenerateBagOfWords operator
var results = dr.generateBagOfWords(data, {inputField:"messageTokens", outputField:"words"});

Properties

The GenerateBagOfWords operator has the following properties.

Name	Type	Description
inputField	String	The name of the tokenized text field for which to generate the bag of words. If this field does not exist in the input or it is not of type TokenizedText, an exception will be issued at composition time.
outputField	String	The name of the output field that will contain the word set. Defaults to Word if unspecified.

Ports

The GenerateBagOfWords operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data with the tokenized text data that will be used to generate the bag of words.

The GenerateBagOfWords operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data including the word set field produced from the input data.

Using the DrawDiagnosticsChart Operator to Draw Diagnostic Charts

To build diagnostic charts, the DrawDiagnosticsChart operator uses confidence values along with the actual target values (true class).

Note: These values are obtained using the input of this operator and output of one or multiple predictors.

The supported chart types are:

ROC chart

Provides a comparison between the true positive rate (y-axis) and false positive rate (x-axis) when the confidence threshold decreases.

Gains chart (CRC, Cumulative Response Chart)

Provides the change in true positive rate (y-axis) when the percentage of the targeted population (x-axis) decreases.

Lift chart

Provides the change in lift (y-axis) which is the ratio between predictor result and baseline when the percentage of the targeted population (x-axis) increases.

The true positive rate is the ratio of all correct positive classifications and total number of positive samples in the test set. The false positive rate is the ratio of all incorrect positive classifications and total number of negative samples in the test set.

The DrawDiagnosticsChart operator can accept one to five predictor operators as input sources. Each predictor output must contain a column for the actual target values and confidence values (probability, score) assigned by the given predictor.

Note: This operator is not parallelizable and runs on a single node in cluster mode.

Output Chart

The following images are examples for the chart types.

/download/attachments/20480524/gains.png?version=1&modificationDate=1415828629471&api=v2

/download/attachments/20480524/lift.png?version=1&modificationDate=1415828629924&api=v2

/download/attachments/20480524/roc.png?version=1&modificationDate=1415828630080&api=v2

Code Example

Using the DrawDiagnosticsChart operator in Java

// Create a diagnostics chart drawer with two input ports
DrawDiagnosticsChart drawer = new DrawDiagnosticsChart(2);
drawer.setConfidenceFieldNames(Arrays.asList("NaiveBayesConfidence", "BaseConfidence"));
drawer.setTargetFieldNames(Arrays.asList("target", "target"));
drawer.setChartNames(Arrays.asList("NAIVE BAYES", "BASE"));
drawer.setChartType(ChartType.GAINS);
drawer.setResultSize(10);
drawer.setTargetValue("E");
drawer.setOutputPath("chart.png");

graph.connect(bayesPredictor.getOutput(), drawer.getInput(0));
graph.connect(basePredictor.getOutput(), drawer.getInput(1));

Using the DrawDiagnosticsChart operator in RushScript

// Using default settings: five (optional) input ports, values for disconnected ports as nulls.
var drawer = dr.drawDiagnosticsChart("chart", bayesPredictor, null, basePredictor, null, null, {
  confidenceFieldNames: ["NaiveBayesConfidence", null, "BaseConfidence", null, null],
  targetFieldNames: ["target", null, "target", null, null],
  chartNames: ["NAIVE BAYES", null, "BASE", null, null],
  outputPath: "chart.png",
  chartType: com.pervasive.datarush.analytics.viz.ChartType.GAINS,
  resultSize: 10,
  targetValue: "E"
});

Properties

The DrawDiagnosticsChart operator provides the following properties.

Name	Type	Description
resultSize	int	The number of points used to create the chart. The sequence of confusion matrices (one per input data point) is split into the given number of equal-sized slices. The result provides the reduced set that is used for creating the chart.
targetValue	String	A value of the target domain that defines the 'true class'.
chartType	ChartType	ROC, Gains, or Lift.
outputPath	String	The path (local file system or HDFS) of the output file (PNG). This property is optional. If the path is not defined, then the data is not written to any file.
chartNames	List<String>	The predictor names for each input port that is displayed as the chart legend. Note: For disconnected input ports, a null value is entered.
confidenceFieldNames	List<String>	The confidence field names for each input port. For any given field name, a field of numeric data type must exist in the corresponding input port schema. Note: For disconnected input ports, a null value is entered.
targetFieldNames	List<String>	The actual target field names for all input ports. For each given field name, a field of string data type must exist in the corresponding input port schema. Note: For disconnected input ports, a null value is entered

Ports

The DrawDiagnosticsChart operator provides an arbitrary number of input ports. It is configured using constructor arguments. The default setting is five input ports. Only port 0 is mandatory.

Name	Type	Get Method	Description
input_x	RecordPort	getInput(int)	The predictor data containing a confidence field and target field.

The DrawDiagnosticsChart operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The confusion matrix data and the x-axis and y-axis values for the diagnostic chart. Generally, this port can be dismissed and the operator can be a sink operator.

Last modified date: 01/03/2025