Was this helpful?
Statistical Operators
The DataFlow operator library includes several pre-built operators for statistics and data summarization. For more information, refer to the following topics:
DataQualityAnalyzer Operator
The DataQualityAnalyzer operator is used to evaluate a set of quality tests on an input data set. Those records for which all tests pass are considered "clean" and are thus sent to the clean output. Those records for which any tests fail are considered "dirty" and are thus sent to the dirty output.
This operator also produces a PMML summary model that includes the following statistics:
totalFrequency: Total number of rows.
invalidFrequency: Total number of rows for which at least one test involving the given field failed.
testFailureCounts: Per-test failure counts for each test involving the given field.
Using Expressions to Create Quality Metrics
Quality metrics can be specified by using the expression language. Any number of quality metrics can be specified by passing a single expression directly to the DataQualityAnalyzer operator. The syntax of a quality metric expression is:
<predicate expression 1> as <metric name 1>[, <predicate expression 2> as <metric name 2>, ...]
Each expression must be a predicate expression that returns a boolean value. For example, the following expression can be passed directly to the DataQualityAnalyzer, assuming your input has the specified input fields:
class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero
As with field names used elsewhere within expressions, the metric name can be surrounded by back-ticks if it contains non-alphanumeric characters, such as in the expression:
class is not null as `class-not-null`
For more information about syntax and available functions, see the Expression Language.
Code Example
This example demonstrates using the DataQualityAnalyzer operator to ensure the "class" field is non-null and that the petal measurements are greater than zero. This example uses a quality metric expression to specify the metrics to apply to the input data.
Using the DataQualityAnlayzer operator in Java
// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
String qualityTests =
    "class is not null as class_not_null, " +
    "`petal length` > 0 as length_gt_zero, " +
    "`petal width` > 0 as width_gt_zero";
dqa.setTests(qualityTests);
This example demonstrates using the DataQualityAnalyzer operator creating QualityTest instances directly.
Using the DataQualityAnlayzer operator in Java
// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
QualityTest test1 = new QualityTest(
                "class_not_null", 
                Predicates.notNull("class"));
QualityTest test2 = new QualityTest(
                "length_gt_zero", 
                Predicates.gt(FieldReference.value("petal length"), ConstantReference.constant(0)));
QualityTest test3 = new QualityTest(
                "width_gt_zero", 
                Predicates.gt(FieldReference.value("petal width"), ConstantReference.constant(0)));
dqa.setTests(Arrays.asList(test1, test2, test3));
Using the DataQualityAnalyzer operator in RushScript
var results = dr.dataQualityAnalyzer(data, {tests:'class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero'});
Properties
The DataQualityAnalyzer operator provides the following properties.
Name
Type
Description
tests
The set of tests to apply to the input data set. The quality tests can be specified using an expression (String) or as a list of QualityTest instances.
Ports
The DataQualityAnalyzer operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data set to be tested.
The DataQualityAnalyzer operator provides the following output ports.
Name
Type
Get Method
Description
clean
getClean()
The output port for the "clean" rows.
dirty
getDirty()
The output port for the "dirty" rows.
model
getModel()
The output port for the PMML statistics model.
SummaryStatistics Operator
The SummaryStatistics operator discovers various metrics of an input data set based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.
If the detail level is SINGLE_PASS_ONLY_SIMPLE, the following statistics are calculated.
Statistic
Description
Field Types
Missing count
Number of missing values per field
all
Min
The minimum value per field
all
Max
The maximum value per field
all
Mean
The mean value per field
int, long, float, double, numeric
Stddev
The standard deviation per field
int, long, float, double, numeric
Variance
The variance per field
int, long, float, double, numeric
Sum
The sum per field
int, long, float, double, numeric
Sum of squares
The sum of squares per field
int, long, float, double, numeric
If the detail level is SINGLE_PASS_ONLY, all of the statistics that are calculated for SINGLE_PASS_ONLY_SIMPLE are calculated. In addition, the following are also calculated.
Statistic
Description
Field Types
Correlation
A matrix is produced where the elements correspond to correlation of pairs of fields.
int, long, float, double, numeric
Covariance
A matrix is produced where the elements correspond to covariance of pairs of fields.
int, long, float, double, numeric
If the detail level is MULTI_PASS, all of the statistics that are calculated for SINGLE_PASS_ONLY are calculated. In addition, the following are also calculated.
Statistic
Description
Field Types
Intervals
Includes counts, sums, and sum of squares for equal-sized intervals. The number of intervals is configurable through the rangeCount property.
int, long, float, double, numeric
Value Counts
Includes the most frequent values and their counts for each field. The number of values to calculate per field is configurable through the showTopHowMany property.
all
Quantiles
The per-field quantiles (equi-depth histograms). The quantiles to calculate are configurable through the quantilesToCalculate property.
int, long, float, double, numeric
Median
The per-field median value.
int, long, float, double, numeric
Inter-quartile-range
The per-field inter-quartile-range.
int, long, float, double, numeric
IMPORTANT!  The correct data type must be selected to avoid overflows. If overflows occur, try increasing the size of the data type from float to double or double to numeric.
Code Example
This example calculates summary statistics for the Iris data set. The SummaryStatistics operator produces a PMML model containing summary statistics. This example writes the PMML to a file. It also obtains an in-memory reference to the statistics and outputs to a file.
Using the SummaryStatistics operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.pmml.PMMLModel;
import com.pervasive.datarush.analytics.pmml.PMMLPort;
import com.pervasive.datarush.analytics.pmml.WritePMML;
import com.pervasive.datarush.analytics.stats.DetailLevel;
import com.pervasive.datarush.analytics.stats.PMMLSummaryStatisticsModel;
import com.pervasive.datarush.analytics.stats.SummaryStatistics;
import com.pervasive.datarush.analytics.stats.UnivariateStats;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.model.GetModel;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;

public class IrisSummaryStats {
    public static void main(String[] args) {
        
        // Create an empty logical graph
        LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("SummaryStats");

        // Create a delimited text reader for the Iris data
        ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
        reader.setFieldSeparator(" ");
        reader.setHeader(true);
        RecordTokenType irisType = record(
                DOUBLE("sepal length"), 
                DOUBLE("sepal width"), 
                DOUBLE("petal length"), 
                DOUBLE("petal width"), 
                STRING("class"));
        reader.setSchema(TextRecord.convert(irisType));
                
        // Run summary statistics on the data and normalized values
        SummaryStatistics summaryStats = graph.add(new SummaryStatistics());
        summaryStats.setDetailLevel(DetailLevel.MULTI_PASS);
        summaryStats.setShowTopHowMany(25);
        graph.connect(reader.getOutput(), summaryStats.getInput());
        
        // Use the GetModel operator to obtain a reference to the statistics model.
        // This reference is valid after the graph is run and can be used to then
        // access the statistics model outside of the graph.
        GetModel<PMMLModel> modelOp = graph.add(new GetModel<PMMLModel>(PMMLPort.FACTORY));
        graph.connect(summaryStats.getOutput(), modelOp.getInput());
        
        // Write the PMML generated by summary stats
        WritePMML pmmlWriter = graph.add(new WritePMML("results/iris-summarystats.pmml"));
        graph.connect(summaryStats.getOutput(), pmmlWriter.getModel());
                        
        // Compile and run the graph
        graph.run();
        
        // Use the model reference to get the actual stats model
        PMMLSummaryStatisticsModel statsModel = (PMMLSummaryStatisticsModel) modelOp.getModel();

        // Print out stats for the numeric fields
        for (String fieldName : new String[] {"sepal length", "sepal width", "petal length", "petal width"}) {
            UnivariateStats fieldStats = statsModel.getFieldStats(fieldName);
            System.out.println("Field: " + fieldName);
            System.out.println("  frequency = " + fieldStats.getTotalFrequency());
            System.out.println("  missing   = " + fieldStats.getMissingFrequency());
            System.out.println("  min       = " + fieldStats.getNumericInfo().getMin());
            System.out.println("  max       = " + fieldStats.getNumericInfo().getMax());
            System.out.println("  mean      = " + fieldStats.getNumericInfo().getMean());
            System.out.println("  stddev    = " + fieldStats.getNumericInfo().getStddev());
        }
    }
}
Using the SummaryStatistics operator in RushScript
var results = dr.summaryStatistics(data, {includedFields:"sepal length", detailLevel:DetailLevel.MULTI_PASS});
Properties
The SummaryStatistics operator has the following properties.
Name
Type
Description
detailLevel
The detail level that is used to compute statistics. The default value is SINGLE_PASS_ONLY.
showTopHowMany
int
Provides a cap on the number of value counts to calculate. Default: 25.
Memory usage is proportional to the number of distinct values; thus only the top showTopHowMany values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if the detail level is SINGLE_PASS_ONLY.
rangeCount
int
The number of interval counts to calculate for each numeric field. The default value is 10. This setting is ignored if detail level is SINGLE_PASS_ONLY.
quantilesToCalculate
List<BigDecimal>
The quantiles to calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles).
includedFields
List<String>
The fields from the input data set for which we are collecting statistics. The default value of "empty list" implies "all fields".
fewDistinctValuesHint
boolean
This should be set to true if a small number of values per column is expected. This will have a large performance benefit, particularly in the cluster, since we can then avoid the overhead of parallelizing computation of quantiles, and so on.
Ports
The SummaryStatistics operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data set that is used to build the summary model.
The SummaryStatistics operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
Returns a PMML model corresponding to the ModelStats element described at http://www.dmg.org/v4-0-1/Statistics.html.
DistinctValues Operator
The DistinctValues operator calculates the distinct values of the given input field. This produces a record consisting of the input field with only the distinct values, and a count field with the number of occurrences of each value.
Code Example
This example calculates the number of distinct types of iris present in the data set.
Using the DistinctValue operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.stats.DistinctValues;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;

/**
 * Determine all distinct classes of iris. 
 */
public class DistinctIris {
    public static void main(String[] args) {

        // Create an empty logical graph
        LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("DistinctValues");

        // Create a delimited text reader for the Iris data
        ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
        reader.setFieldSeparator(" ");
        reader.setHeader(true);
        RecordTokenType irisType = record(
                DOUBLE("sepal length"), 
                DOUBLE("sepal width"), 
                DOUBLE("petal length"), 
                DOUBLE("petal width"), 
                STRING("class"));
        reader.setSchema(TextRecord.convert(irisType));
        
        // Initialize the DistinctValues operator
        DistinctValues distinct = graph.add(new DistinctValues());
        distinct.setInputField("class");
        
        // Connect the reader to distinct
        graph.connect(reader.getOutput(), distinct.getInput());
        
        // Write the distinct values for the class field
        WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-distinct.txt", WriteMode.OVERWRITE));
        writer.setFieldSeparator(",");
        writer.setFieldDelimiter("");
        writer.setHeader(true);
        writer.setWriteSingleSink(true);
        
        // Connect rank to the writer
        graph.connect(distinct.getOutput(), writer.getInput());
        
        // Compile and run the graph
        graph.run();
    }
}
Using the DistinctValues operator in RushScript
var results = dr.distinctValues(data, {inputField:"class"});
Properties
The DistinctValues operator provides the following properties.
Name
Type
Description
inputField
String
The input field for which we calculate distinct values.
sortByCount
boolean
Whether to sort the output by value count.
Ports
The DistinctValues operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data.
The DistinctValues operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
Output consisting of two fields, the distinct values and the counts.
NormalizeValues Operator
The NormalizeValues operator applies normalization methods to fields within an input data flow. The results of the normalization methods are available in the output flow. All input fields are present in the output with the addition of the calculated normalizations.
Normalization methods require certain statistics about the input data such as the mean, standard deviation, minimum value, maximum value, and so on. These statistics are captured in a PMMLModel. The statistics can be gathered by an upstream operator such as SummaryStatistics and passed into this operator. If not, they will be calculated with a first pass over the data and then applied in a second pass.
Code Example
This example normalizes the Iris data set using the z-score method.
Using the NormalizeValues operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import static com.pervasive.datarush.analytics.functions.StatsFunctions.NormalizeMethod.ZSCORE;
import com.pervasive.datarush.analytics.stats.NormalizeValues;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;

/**
* Compute the normalized z-score values for the iris data set.
 */
public class NormalizeIris {
    public static void main(String[] args) {

        // Create an empty logical graph
        LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("Normalize");

        // Create a delimited text reader for the Iris data
        ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
        reader.setFieldSeparator(" ");
        reader.setHeader(true);
        RecordTokenType irisType = record(
                DOUBLE("sepal length"), 
                DOUBLE("sepal width"), 
                DOUBLE("petal length"), 
                DOUBLE("petal width"), 
                STRING("class"));
        reader.setSchema(TextRecord.convert(irisType));
        
        // Initialize the NormalizeValues operator
        NormalizeValues norm = graph.add(new NormalizeValues());
        norm.setScoreFields("sepal length", "sepal width", "petal length", "petal width");
        norm.setMethod(ZSCORE);
        
        // Connect the reader to normalize
        graph.connect(reader.getOutput(), norm.getInput());
        
        // Write the normalized data
        WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-zscore.txt", WriteMode.OVERWRITE));
        writer.setFieldSeparator(",");
        writer.setFieldDelimiter("");
        writer.setHeader(true);
        writer.setWriteSingleSink(true);
        
        // Connect normalize to the writer
        graph.connect(norm.getOutput(), writer.getInput());
        
        // Compile and run the graph
        graph.run();
    }
}
Using the NormalizeValues operator in RushScript
var scoreFields = ["sepal length", "sepal width", "petal length", "petal width"];
var results = dr.normalizeValues(data, {scoreFields:scoreFields, method:NormalizeMethod.ZSCORE});
Properties
The NormalizeValues operator provides the following properties.
Name
Type
Description
includeInputFields
boolean
The indicator of whether to include the input fields in the output data. Setting this property to true causes the input values to be transferred to the output. Otherwise the input values are excluded, leaving only the transformed fields in the output data. Default: true.
method
The normalization method to use.
scoreFields
List<String>
The names of the input fields to normalize. If no field names are provided, all fields will be transformed by default.
Ports
The NormalizeValues operator provides the following input ports.
Name
Type
Get Method
Description
input
getInput()
The input data.
modelInput
getModelInput()
The optional input port used to provide the PMML model containing field statistics needed by normalization methods.
The NormalizeValues operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
The normalized output data.
Rank Operator
The Rank operator is used to rank data using the given rank mode. The data is grouped by the given partition fields and is sorted within the grouping by the ranking fields. An example is to rank employees by salary per department. To rank the highest to lowest salary within department: partition by the department and rank by the salary in descending sort order.
Three different rank modes are supported:
STANDARD
Also known as competition ranking, items with the same ranking values have the same rank and then a gap is left in the ranking numbers. For example: 1224
DENSE
Items that comparison determines are equal receive the same ranking. Items following those receive the next ordinal ranking (that is, ranks are not skipped). For example: 1223
ORDINAL
Each item receives a distinct ranking, starting at one and increasing by one, producing essentially a row number within the partition. For example: 1234
A new output field is created to contain the result of the ranking. The field is named "rank" by default.
Code Example
In this example we use the Rank operator to order the Iris data set by the "sepal length" field, partitioning by the "class" field.
Using the Rank operator in Java
import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.stats.Rank;
import com.pervasive.datarush.analytics.stats.Rank.RankMode;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;

/**
 * Rank the iris data set by sepal length, partition by class
 */
public class RankIris {
    public static void main(String[] args) {

            // Create an empty logical graph
            LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("Rank");

            // Create a delimited text reader for the Iris data
            ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
            reader.setFieldSeparator(" ");
            reader.setHeader(true);
            RecordTokenType irisType = record(
                    DOUBLE("sepal length"), 
                    DOUBLE("sepal width"), 
                    DOUBLE("petal length"), 
                    DOUBLE("petal width"), 
                    STRING("class"));
            reader.setSchema(TextRecord.convert(irisType));
            
            // Initialize the Rank operator
            Rank rank = graph.add(new Rank());
            rank.setPartitionBy("class");
            rank.setRankBy("sepal length");
            rank.setMode(RankMode.STANDARD);
            
            // Connect the reader to rank
            graph.connect(reader.getOutput(), rank.getInput());
            
            // Write the data with the additional rank field
            WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-rank.txt", WriteMode.OVERWRITE));
            writer.setFieldSeparator(",");
            writer.setFieldDelimiter("");
            writer.setHeader(true);
            writer.setWriteSingleSink(true);
            
            // Connect rank to the writer
            graph.connect(rank.getOutput(), writer.getInput());
            
            // Compile and run the graph
            graph.run();
    }
}
Using the Rank operator in RushScript
var results = dr.rank(data, {partitionBy:'class', rankBy:'"sepal length" desc', mode:RankMode.STANDARD});
Properties
The Rank operator provides the following properties.
Name
Type
Description
mode
The ranking mode. Ordinal ranking is used by default.
outputField
String
The name of the output field containing the result of the ranking. Defaults to "rank".
partitionKeys
List<String>
The fields used to partition the data. Must specify a minimum of at least one field.
rankKeys
List<SortKey>
The fields used to rank the data. A list of Strings can also be used in which case the sort order defaults to 'ascending'. The data within each partition is sorted by the specified order. This specifies the set of fields used to calculate the rating within each partitioned group.
Ports
The Rank operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data.
The Rank operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
The original data with the additional rank field.
SumOfSquares Operator
The SumOfSquares operator computes the sum of squares for the given fields in the input data. The inner products are calculated in a distributed fashion with a reduction at the end to produce the sum of squares matrix. Note that all the fields must be of type double or be assignable to a double type.
Code Example
The following example demonstrates computing the sum of squares matrix over three double fields.
Using the SumOfSquares operator in Java
// Calculate the Sum of Squares
SumOfSquares sos = graph.add(new SumOfSquares());
sos.setFieldNames(Arrays.asList(new String[]{"dblfield1", "dblfield2", "dblfield3"}));
Using the SumOfSquares operator in RushScript
var results = dr.sumOfSquares(data, {
    fieldNames:['dblfield1', 'dblfield2', 'dblfield3']});
Properties
The SumOfSquares operator provides the following properties.
Name
Type
Description
fieldNames
List<String>
The list of fields to apply sum of squares. The field names must be valid names within the schema of the input port. The fields types must be compatible with the double type.
Ports
The SumOfSquares operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The data used to build the model.
The SumOfSquares operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
This port will contain the sum of squares for the specified fields in a matrix. Only a single token will be available on this output port.
CountRanges Operator
The CountRanges operator is used to determine the range bin each value in the input data set falls in. It will calculate the total number of data values that fall within each range. The value ranges are automatically sorted in ascending order and the entire range of possible values is always considered. The operation is defined by a list of breakpoints that are used as the boundaries for the ranges. A list of n breaks defines n+1 range groups which are indexed beginning with 1. The first and last groups are unbounded on one side each. The range groups are sorted in ascending order based on the comparable interface of the field. The behavior of range intervals closures can also be adjusted by enabling closed lower or upper bounds which will include values equal to the boundary in the respective interval. A value can only be included in a single range group so both the lower and upper bound cannot be closed. Any value which is not included in any range group such as null or the boundary values is included in group 0.
A new output field is created to contain the range group of the specified field. The field is named after the original field with "_RangeGroup" appended to the name. The statistics output of this operator outputs the counts of the defined range groups. This output includes two fields, the range group index and the total number of values within that group.
Code Example
In this example, we use the CountRanges operator to count values of the Iris data set and bin the values of the "petal length" field.
Using the CountRanges operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("CountRangesIris");
 
//Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
reader.setFieldSeparator(",");
reader.setHeader(true);
RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
 
//To ensure sort order is preserved
AssertSorted assertSort = graph.add(new AssertSorted());
assertSort.setOrdering("petal length");
graph.connect(reader.getOutput(), assertSort.getInput());
 
//Initialize CountRanges Operator
CountRanges countRanges = graph.add(new CountRanges());
countRanges.setFieldName(fieldName);
countRanges.setBreaks(breaks);
countRanges.setLowerBoundClosed(lowerClosed);
countRanges.setUpperBoundClosed(upperClosed);
 
//Connect the reader to CountRanges
graph.connect(assertSort.getOutput(), countRanges.getInput());
 
// write the data with CountRanges
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-CountRanges.txt", WriteMode.OVERWRITE));
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.disableParallelism();
 
String statsOutputPath = outputPath.replace(".txt", "-stats.txt");
WriteDelimitedText statswriter = graph.add(new WriteDelimitedText(statsOutputPath, WriteMode.OVERWRITE));
statswriter.setFieldDelimiter("");
statswriter.setHeader(true);
statswriter.disableParallelism();
 
//Connect CountRanges to the writer
graph.connect(countRanges.getOutput(), writer.getInput());
graph.connect(countRanges.getStatsOutput(), statswriter.getInput());
 
// Compile and run the graph
graph.run();
Using the CountRanges operator in RushScript
var breaks = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
var results = dr.countRanges(data, {fieldName:"petal length", upperBoundClosed:true, breaks:breaks})
Properties
The CountRanges operator provides the following properties.
Name
Type
Description
fieldName
String
The name of the field will be divided into ranges.
breaks
List
The values that will be used as the boundaries for the ranges. These should be of the same type as the selected fields.
lowerBoundClosed
boolean
If the lower boundary defined by a range should be included in the group. Default is false.
upperBoundClosed
boolean
If the upper boundary defined by a range should be included in the group. Default is false.
Ports
The CountRagnes operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data.
The CountRanges operator provides two output ports.
Name
Type
Get Method
Description
output
getOutput()
The original data with the additional range group field.
statsOutput
getStatsOutput()
The count data for the defined ranges.
EqualRangeBinning Operator
The EqualRangeBinning operator determines the equally ranged bins each numeric value should be placed in within a given total range of values. The output of the operator includes the addition of a single integer field that includes the bin the selected field falls within that ranges from 1 to n where n is the total number of bins defined by the user.
 
The desired number of bins must be specified although the lower or upper bound may optionally be omitted. If the bounds are not defined the operator will determine an appropriate value for a missing bound based on the minimum and maximum values discovered in the data during runtime. Any null values or values outside of the inclusion range defined by the bounds will be considered an outlier and will either be filtered from the data or optionally included within bin 0 if desired.
Additionally the lower and upper bound for each individual bin can optionally be output with the bin values as two additional fields so the ranges for each bin are available. The ranges for each bin consist of a open lower bound and a closed upper bound for all bins except the maximum bin which also has an open upper bound. Any outlier values will always be contained in bin 0 if included.
Code Example
In this example we are binning one of the lengths available in the iris data set.
Using the EqualRangeBinning operator in Java
// Create the binning operator and add it to the graph
EqualRangeBinning equalRangeBinner = graph.add(new EqualRangeBinning("sepal length", 4));
equalRangeBinner.setLowerBound(5.0);
equalRangeBinner.setUpperBound(7.0);
Using the EqualRangeBinning operator in RushScript
var equalRangeBinner = dr.equalRangeBinning(data, {fieldName:'sepal length', binCount:4, lowerBound:5.0, upperBound:7.0});
Properties
The EqualRangeBinning operator provides the following properties.
Name
Type
Description
fieldName
String
The name of the numeric field to bin.
binCount
int
The number of equally ranged bins to use.
lowerBound
numeric
The lower bound on the first bin.
upperBound
numeric
The upper bound on the last bin.
includeOutlier
boolean
Whether the output includes or filters outliers.
includeRanges
boolean
Whether the output explicitly includes the ranges on each bin.
Ports
The EqualRangeBinning operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The data including the field to bin.
The EqualRangeBinning operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
The data with included bins. Contains the full original input data unless outliers are filtered as well as new fields including the bin and optionally ranges.
MostFrequentValues Operator
The MostFrequentValues operator is used to determine which values are the most frequent within the selected fields of the input data. A maximum must be specified to indicate how many of the most common values should be output.
The output contains two fields for each field selected from the input. The fields will include the value field from the input with the topmost frequent values and a field associated with each that contains the frequency count.
 
Code Example
In this example we the MostFrequentValues operator to find the top 5 values in each of the numeric fields in the Iris data set.
Using the MostFrequentValues operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("mostFreqValuesIris");
 
//Create a delimited text reader for the Iris data
ReadDelimitedText reader = graph.add(new ReadDelimitedText(getResourcePath("iris.txt")));
reader.setFieldSeparator(" ");
reader.setHeader(true);
RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));
reader.setSchema(TextRecord.convert(irisType));
 
// Initialize MostFrequentValues operator
MostFrequentValues mostFreq = graph.add(new MostFrequentValues());
if (fieldNames.length > 0) {
mostFreq.setFieldNames(fieldNames);
}
if (topNum >= 0) {
mostFreq.setShowTopHowMany(topNum);
}
 
//Connect the reader to MostFrequentValues
graph.connect(reader.getOutput(), mostFreq.getInput());
 
// write the data with MostFrequentValues
WriteDelimitedText writer = graph.add(new WriteDelimitedText(outputPath, WriteMode.OVERWRITE));
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.disableParallelism();
writer.setSaveMetadata(false);
 
//Connect MostFrequentValues to the writer
graph.connect(mostFreq.getOutput(), writer.getInput());
 
// Compile and run the graph
graph.run();
Using the MostFrequentValues operator in RushScript
var freqFields = ["sepal length", "sepal width", "petal length", "petal width"];
var results = dr.MostFrequentValues(data, {fieldNames:freqFields, showTopHowMany:5})
Properties
The MostFrequentValues operator provides the following properties.
Name
Type
Description
fieldName
List<String>
The names of the input fields to calculate frequency.
showTopHowMany
int
The max number of value frequencies to calculate. The default is 25.
fewDistinctValuesHint
boolean
A hint as to whether there are expected to be a small number of distinct values.
Ports
The MostFrequentValues operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data.
The MostFrequentValues operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
Output consisting of two fields, the frequent values and the counts.
Last modified date: 03/10/2025