Statistical Operators
The DataFlow operator library includes several pre-built operators for statistics and data summarization. For more information, refer to the following topics:
DataQualityAnalyzer Operator
The
DataQualityAnalyzer operator is used to evaluate a set of quality tests on an input data set. Those records for which all tests pass are considered "clean" and are thus sent to the clean output. Those records for which any tests fail are considered "dirty" and are thus sent to the dirty output.
This operator also produces a PMML summary model that includes the following statistics:
• totalFrequency: Total number of rows.
• invalidFrequency: Total number of rows for which at least one test involving the given field failed.
• testFailureCounts: Per-test failure counts for each test involving the given field.
Using Expressions to Create Quality Metrics
Quality metrics can be specified by using the expression language. Any number of quality metrics can be specified by passing a single expression directly to the DataQualityAnalyzer operator. The syntax of a quality metric expression is:
<predicate expression 1> as <metric name 1>[, <predicate expression 2> as <metric name 2>, ...]
Each expression must be a predicate expression that returns a boolean value. For example, the following expression can be passed directly to the DataQualityAnalyzer, assuming your input has the specified input fields:
class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero
As with field names used elsewhere within expressions, the metric name can be surrounded by back-ticks if it contains non-alphanumeric characters, such as in the expression:
class is not null as `class-not-null`
For more information about syntax and available functions, see the
Expression Language.
Code Example
This example demonstrates using the DataQualityAnalyzer operator to ensure the "class" field is non-null and that the petal measurements are greater than zero. This example uses a quality metric expression to specify the metrics to apply to the input data.
Using the DataQualityAnlayzer operator in Java
This example demonstrates using the DataQualityAnalyzer operator creating QualityTest instances directly.
Using the DataQualityAnlayzer operator in Java
// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
QualityTest test1 = new QualityTest(
"class_not_null",
Predicates.notNull("class"));
QualityTest test2 = new QualityTest(
"length_gt_zero",
Predicates.gt(FieldReference.value("petal length"), ConstantReference.constant(0)));
QualityTest test3 = new QualityTest(
"width_gt_zero",
Predicates.gt(FieldReference.value("petal width"), ConstantReference.constant(0)));
dqa.setTests(Arrays.asList(test1, test2, test3));
Using the DataQualityAnalyzer operator in RushScript
var results = dr.dataQualityAnalyzer(data, {tests:'class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero'});
Properties
The
DataQualityAnalyzer operator provides the following properties.
Ports
The
DataQualityAnalyzer operator provides a single input port.
The
DataQualityAnalyzer operator provides the following output ports.
SummaryStatistics Operator
The
SummaryStatistics operator discovers various metrics of an input data set based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.
If the detail level is SINGLE_PASS_ONLY_SIMPLE, the following statistics are calculated.
If the detail level is SINGLE_PASS_ONLY, all of the statistics that are calculated for SINGLE_PASS_ONLY_SIMPLE are calculated. In addition, the following are also calculated.
If the detail level is MULTI_PASS, all of the statistics that are calculated for SINGLE_PASS_ONLY are calculated. In addition, the following are also calculated.
IMPORTANT! The correct data type must be selected to avoid overflows. If overflows occur, try increasing the size of the data type from float to double or double to numeric.
Code Example
This example calculates summary statistics for the Iris data set. The
SummaryStatistics operator produces a PMML model containing summary statistics. This example writes the PMML to a file. It also obtains an in-memory reference to the statistics and outputs to a file.
Using the SummaryStatistics operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.pmml.PMMLModel;
import com.pervasive.datarush.analytics.pmml.PMMLPort;
import com.pervasive.datarush.analytics.pmml.WritePMML;
import com.pervasive.datarush.analytics.stats.DetailLevel;
import com.pervasive.datarush.analytics.stats.PMMLSummaryStatisticsModel;
import com.pervasive.datarush.analytics.stats.SummaryStatistics;
import com.pervasive.datarush.analytics.stats.UnivariateStats;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.model.GetModel;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;
public class IrisSummaryStats {
public static void main(String[] args) {
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("SummaryStats");
// Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
reader.setFieldSeparator(" ");
reader.setHeader(true);
RecordTokenType irisType = record(
DOUBLE("sepal length"),
DOUBLE("sepal width"),
DOUBLE("petal length"),
DOUBLE("petal width"),
STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
// Run summary statistics on the data and normalized values
SummaryStatistics summaryStats = graph.add(new SummaryStatistics());
summaryStats.setDetailLevel(DetailLevel.MULTI_PASS);
summaryStats.setShowTopHowMany(25);
graph.connect(reader.getOutput(), summaryStats.getInput());
// Use the GetModel operator to obtain a reference to the statistics model.
// This reference is valid after the graph is run and can be used to then
// access the statistics model outside of the graph.
GetModel<PMMLModel> modelOp = graph.add(new GetModel<PMMLModel>(PMMLPort.FACTORY));
graph.connect(summaryStats.getOutput(), modelOp.getInput());
// Write the PMML generated by summary stats
WritePMML pmmlWriter = graph.add(new WritePMML("results/iris-summarystats.pmml"));
graph.connect(summaryStats.getOutput(), pmmlWriter.getModel());
// Compile and run the graph
graph.run();
// Use the model reference to get the actual stats model
PMMLSummaryStatisticsModel statsModel = (PMMLSummaryStatisticsModel) modelOp.getModel();
// Print out stats for the numeric fields
for (String fieldName : new String[] {"sepal length", "sepal width", "petal length", "petal width"}) {
UnivariateStats fieldStats = statsModel.getFieldStats(fieldName);
System.out.println("Field: " + fieldName);
System.out.println(" frequency = " + fieldStats.getTotalFrequency());
System.out.println(" missing = " + fieldStats.getMissingFrequency());
System.out.println(" min = " + fieldStats.getNumericInfo().getMin());
System.out.println(" max = " + fieldStats.getNumericInfo().getMax());
System.out.println(" mean = " + fieldStats.getNumericInfo().getMean());
System.out.println(" stddev = " + fieldStats.getNumericInfo().getStddev());
}
}
}
Using the SummaryStatistics operator in RushScript
var results = dr.summaryStatistics(data, {includedFields:"sepal length", detailLevel:DetailLevel.MULTI_PASS});
Properties
The
SummaryStatistics operator has the following properties.
Ports
The
SummaryStatistics operator provides a single input port.
The
SummaryStatistics operator provides a single output port.
DistinctValues Operator
The
DistinctValues operator calculates the distinct values of the given input field. This produces a record consisting of the input field with only the distinct values, and a count field with the number of occurrences of each value.
Code Example
This example calculates the number of distinct types of iris present in the data set.
Using the DistinctValue operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.stats.DistinctValues;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;
/**
* Determine all distinct classes of iris.
*/
public class DistinctIris {
public static void main(String[] args) {
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("DistinctValues");
// Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
reader.setFieldSeparator(" ");
reader.setHeader(true);
RecordTokenType irisType = record(
DOUBLE("sepal length"),
DOUBLE("sepal width"),
DOUBLE("petal length"),
DOUBLE("petal width"),
STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
// Initialize the DistinctValues operator
DistinctValues distinct = graph.add(new DistinctValues());
distinct.setInputField("class");
// Connect the reader to distinct
graph.connect(reader.getOutput(), distinct.getInput());
// Write the distinct values for the class field
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-distinct.txt", WriteMode.OVERWRITE));
writer.setFieldSeparator(",");
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.setWriteSingleSink(true);
// Connect rank to the writer
graph.connect(distinct.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
}
}
Using the DistinctValues operator in RushScript
var results = dr.distinctValues(data, {inputField:"class"});
Properties
The
DistinctValues operator provides the following properties.
Ports
The
DistinctValues operator provides a single input port.
The
DistinctValues operator provides a single output port.
NormalizeValues Operator
The
NormalizeValues operator applies normalization methods to fields within an input data flow. The results of the normalization methods are available in the output flow. All input fields are present in the output with the addition of the calculated normalizations.
Normalization methods require certain statistics about the input data such as the mean, standard deviation, minimum value, maximum value, and so on. These statistics are captured in a
PMMLModel. The statistics can be gathered by an upstream operator such as
SummaryStatistics and passed into this operator. If not, they will be calculated with a first pass over the data and then applied in a second pass.
Code Example
This example normalizes the Iris data set using the z-score method.
Using the NormalizeValues operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import static com.pervasive.datarush.analytics.functions.StatsFunctions.NormalizeMethod.ZSCORE;
import com.pervasive.datarush.analytics.stats.NormalizeValues;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;
/**
* Compute the normalized z-score values for the iris data set.
*/
public class NormalizeIris {
public static void main(String[] args) {
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("Normalize");
// Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
reader.setFieldSeparator(" ");
reader.setHeader(true);
RecordTokenType irisType = record(
DOUBLE("sepal length"),
DOUBLE("sepal width"),
DOUBLE("petal length"),
DOUBLE("petal width"),
STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
// Initialize the NormalizeValues operator
NormalizeValues norm = graph.add(new NormalizeValues());
norm.setScoreFields("sepal length", "sepal width", "petal length", "petal width");
norm.setMethod(ZSCORE);
// Connect the reader to normalize
graph.connect(reader.getOutput(), norm.getInput());
// Write the normalized data
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-zscore.txt", WriteMode.OVERWRITE));
writer.setFieldSeparator(",");
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.setWriteSingleSink(true);
// Connect normalize to the writer
graph.connect(norm.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
}
}
Using the NormalizeValues operator in RushScript
var scoreFields = ["sepal length", "sepal width", "petal length", "petal width"];
var results = dr.normalizeValues(data, {scoreFields:scoreFields, method:NormalizeMethod.ZSCORE});
Properties
The
NormalizeValues operator provides the following properties.
Ports
The
NormalizeValues operator provides the following input ports.
The
NormalizeValues operator provides a single output port.
Rank Operator
The
Rank operator is used to rank data using the given rank mode. The data is grouped by the given partition fields and is sorted within the grouping by the ranking fields. An example is to rank employees by salary per department. To rank the highest to lowest salary within department: partition by the department and rank by the salary in descending sort order.
Three different rank modes are supported:
STANDARD
Also known as competition ranking, items with the same ranking values have the same rank and then a gap is left in the ranking numbers. For example: 1224
DENSE
Items that comparison determines are equal receive the same ranking. Items following those receive the next ordinal ranking (that is, ranks are not skipped). For example: 1223
ORDINAL
Each item receives a distinct ranking, starting at one and increasing by one, producing essentially a row number within the partition. For example: 1234
A new output field is created to contain the result of the ranking. The field is named "rank" by default.
Code Example
In this example we use the
Rank operator to order the Iris data set by the "sepal length" field, partitioning by the "class" field.
Using the Rank operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.stats.Rank;
import com.pervasive.datarush.analytics.stats.Rank.RankMode;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;
/**
* Rank the iris data set by sepal length, partition by class
*/
public class RankIris {
public static void main(String[] args) {
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("Rank");
// Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
reader.setFieldSeparator(" ");
reader.setHeader(true);
RecordTokenType irisType = record(
DOUBLE("sepal length"),
DOUBLE("sepal width"),
DOUBLE("petal length"),
DOUBLE("petal width"),
STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
// Initialize the Rank operator
Rank rank = graph.add(new Rank());
rank.setPartitionBy("class");
rank.setRankBy("sepal length");
rank.setMode(RankMode.STANDARD);
// Connect the reader to rank
graph.connect(reader.getOutput(), rank.getInput());
// Write the data with the additional rank field
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-rank.txt", WriteMode.OVERWRITE));
writer.setFieldSeparator(",");
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.setWriteSingleSink(true);
// Connect rank to the writer
graph.connect(rank.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
}
}
Using the Rank operator in RushScript
var results = dr.rank(data, {partitionBy:'class', rankBy:'"sepal length" desc', mode:RankMode.STANDARD});
Properties
The
Rank operator provides the following properties.
Ports
The
Rank operator provides a single input port.
The
Rank operator provides a single output port.
SumOfSquares Operator
The
SumOfSquares operator computes the sum of squares for the given fields in the input data. The inner products are calculated in a distributed fashion with a reduction at the end to produce the sum of squares matrix. Note that all the fields must be of type double or be assignable to a double type.
Code Example
The following example demonstrates computing the sum of squares matrix over three double fields.
Using the SumOfSquares operator in Java
// Calculate the Sum of Squares
SumOfSquares sos = graph.add(new SumOfSquares());
sos.setFieldNames(Arrays.asList(new String[]{"dblfield1", "dblfield2", "dblfield3"}));
Using the SumOfSquares operator in RushScript
var results = dr.sumOfSquares(data, {
fieldNames:['dblfield1', 'dblfield2', 'dblfield3']});
Properties
The
SumOfSquares operator provides the following properties.
Ports
The
SumOfSquares operator provides a single input port.
The
SumOfSquares operator provides a single output port.
CountRanges Operator
The
CountRanges operator is used to determine the range bin each value in the input data set falls in. It will calculate the total number of data values that fall within each range. The value ranges are automatically sorted in ascending order and the entire range of possible values is always considered. The operation is defined by a list of breakpoints that are used as the boundaries for the ranges. A list of n breaks defines n+1 range groups which are indexed beginning with 1. The first and last groups are unbounded on one side each. The range groups are sorted in ascending order based on the comparable interface of the field. The behavior of range intervals closures can also be adjusted by enabling closed lower or upper bounds which will include values equal to the boundary in the respective interval. A value can only be included in a single range group so both the lower and upper bound cannot be closed. Any value which is not included in any range group such as null or the boundary values is included in group 0.
A new output field is created to contain the range group of the specified field. The field is named after the original field with "_RangeGroup" appended to the name. The statistics output of this operator outputs the counts of the defined range groups. This output includes two fields, the range group index and the total number of values within that group.
Code Example
In this example, we use the
CountRanges operator to count values of the Iris data set and bin the values of the "petal length" field.
Using the CountRanges operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("CountRangesIris");
//Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
reader.setFieldSeparator(",");
reader.setHeader(true);
RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
//To ensure sort order is preserved
AssertSorted assertSort = graph.add(new AssertSorted());
assertSort.setOrdering("petal length");
graph.connect(reader.getOutput(), assertSort.getInput());
//Initialize CountRanges Operator
CountRanges countRanges = graph.add(new CountRanges());
countRanges.setFieldName(fieldName);
countRanges.setBreaks(breaks);
countRanges.setLowerBoundClosed(lowerClosed);
countRanges.setUpperBoundClosed(upperClosed);
//Connect the reader to CountRanges
graph.connect(assertSort.getOutput(), countRanges.getInput());
// write the data with CountRanges
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-CountRanges.txt", WriteMode.OVERWRITE));
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.disableParallelism();
String statsOutputPath = outputPath.replace(".txt", "-stats.txt");
WriteDelimitedText statswriter = graph.add(new WriteDelimitedText(statsOutputPath, WriteMode.OVERWRITE));
statswriter.setFieldDelimiter("");
statswriter.setHeader(true);
statswriter.disableParallelism();
//Connect CountRanges to the writer
graph.connect(countRanges.getOutput(), writer.getInput());
graph.connect(countRanges.getStatsOutput(), statswriter.getInput());
// Compile and run the graph
graph.run();
Using the CountRanges operator in RushScript
var breaks = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
var results = dr.countRanges(data, {fieldName:"petal length", upperBoundClosed:true, breaks:breaks})
Properties
The
CountRanges operator provides the following properties.
Ports
The
CountRagnes operator provides a single input port.
The CountRanges operator provides two output ports.
EqualRangeBinning Operator
The
EqualRangeBinning operator determines the equally ranged bins each numeric value should be placed in within a given total range of values. The output of the operator includes the addition of a single integer field that includes the bin the selected field falls within that ranges from 1 to n where n is the total number of bins defined by the user.
The desired number of bins must be specified although the lower or upper bound may optionally be omitted. If the bounds are not defined the operator will determine an appropriate value for a missing bound based on the minimum and maximum values discovered in the data during runtime. Any null values or values outside of the inclusion range defined by the bounds will be considered an outlier and will either be filtered from the data or optionally included within bin 0 if desired.
Additionally the lower and upper bound for each individual bin can optionally be output with the bin values as two additional fields so the ranges for each bin are available. The ranges for each bin consist of a open lower bound and a closed upper bound for all bins except the maximum bin which also has an open upper bound. Any outlier values will always be contained in bin 0 if included.
Code Example
In this example we are binning one of the lengths available in the iris data set.
Using the EqualRangeBinning operator in Java
// Create the binning operator and add it to the graph
EqualRangeBinning equalRangeBinner = graph.add(new EqualRangeBinning("sepal length", 4));
equalRangeBinner.setLowerBound(5.0);
equalRangeBinner.setUpperBound(7.0);
Using the EqualRangeBinning operator in RushScript
var equalRangeBinner = dr.equalRangeBinning(data, {fieldName:'sepal length', binCount:4, lowerBound:5.0, upperBound:7.0});
Properties
The
EqualRangeBinning operator provides the following properties.
Ports
The
EqualRangeBinning operator provides a single input port.
The
EqualRangeBinning operator provides a single output port.
MostFrequentValues Operator
The
MostFrequentValues operator is used to determine which values are the most frequent within the selected fields of the input data. A maximum must be specified to indicate how many of the most common values should be output.
The output contains two fields for each field selected from the input. The fields will include the value field from the input with the topmost frequent values and a field associated with each that contains the frequency count.
Code Example
In this example we the
MostFrequentValues operator to find the top 5 values in each of the numeric fields in the Iris data set.
Using the MostFrequentValues operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("mostFreqValuesIris");
//Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText(getResourcePath("iris.txt")));
reader.setFieldSeparator(" ");
reader.setHeader(true);
RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
// Initialize MostFrequentValues operator
MostFrequentValues mostFreq = graph.add(new MostFrequentValues());
if (fieldNames.length > 0) {
mostFreq.setFieldNames(fieldNames);
}
if (topNum >= 0) {
mostFreq.setShowTopHowMany(topNum);
}
//Connect the reader to MostFrequentValues
graph.connect(reader.getOutput(), mostFreq.getInput());
// write the data with MostFrequentValues
WriteDelimitedText writer = graph.add(new WriteDelimitedText(outputPath, WriteMode.OVERWRITE));
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.disableParallelism();
writer.setSaveMetadata(false);
//Connect MostFrequentValues to the writer
graph.connect(mostFreq.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
Using the MostFrequentValues operator in RushScript
var freqFields = ["sepal length", "sepal width", "petal length", "petal width"];
var results = dr.MostFrequentValues(data, {fieldNames:freqFields, showTopHowMany:5})
Properties
The
MostFrequentValues operator provides the following properties.
Ports
The
MostFrequentValues operator provides a single input port.
The
MostFrequentValues operator provides a single output port.