DF 8.2 | Statistical Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Statistical Operators

Was this helpful?

Statistical Operators

The DataFlow operator library includes several pre-built operators for statistics and data summarization. For more information, refer to the following topics:

• DataQualityAnalyzer Operator

• SummaryStatistics Operator

• DistinctValues Operator

• NormalizeValues Operator

• Rank Operator

• SumOfSquares Operator

• SummaryStatistics Operator

• CountRanges Operator

• EqualRangeBinning Operator

• MostFrequentValues Operator

DataQualityAnalyzer Operator

The DataQualityAnalyzer operator is used to evaluate a set of quality tests on an input data set. Those records for which all tests pass are considered "clean" and are thus sent to the clean output. Those records for which any tests fail are considered "dirty" and are thus sent to the dirty output.

This operator also produces a PMML summary model that includes the following statistics:

• totalFrequency: Total number of rows.

• invalidFrequency: Total number of rows for which at least one test involving the given field failed.

• testFailureCounts: Per-test failure counts for each test involving the given field.

Using Expressions to Create Quality Metrics

Quality metrics can be specified by using the expression language. Any number of quality metrics can be specified by passing a single expression directly to the DataQualityAnalyzer operator. The syntax of a quality metric expression is:

<predicate expression 1> as <metric name 1>[, <predicate expression 2> as <metric name 2>, ...]

Each expression must be a predicate expression that returns a boolean value. For example, the following expression can be passed directly to the DataQualityAnalyzer, assuming your input has the specified input fields:

class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero

As with field names used elsewhere within expressions, the metric name can be surrounded by back-ticks if it contains non-alphanumeric characters, such as in the expression:

class is not null as `class-not-null`

For more information about syntax and available functions, see the Expression Language.

Code Example

This example demonstrates using the DataQualityAnalyzer operator to ensure the "class" field is non-null and that the petal measurements are greater than zero. This example uses a quality metric expression to specify the metrics to apply to the input data.

Using the DataQualityAnlayzer operator in Java

// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
String qualityTests =
    "class is not null as class_not_null, " +
    "`petal length` > 0 as length_gt_zero, " +
    "`petal width` > 0 as width_gt_zero";
dqa.setTests(qualityTests);

This example demonstrates using the DataQualityAnalyzer operator creating QualityTest instances directly.

Using the DataQualityAnlayzer operator in Java

// Create the DataQualityAnalyzer operator
DataQualityAnalyzer dqa = graph.add(new DataQualityAnalyzer());
QualityTest test1 = new QualityTest(
                "class_not_null",
                Predicates.notNull("class"));
QualityTest test2 = new QualityTest(
                "length_gt_zero",
                Predicates.gt(FieldReference.value("petal length"), ConstantReference.constant(0)));
QualityTest test3 = new QualityTest(
                "width_gt_zero",
                Predicates.gt(FieldReference.value("petal width"), ConstantReference.constant(0)));
dqa.setTests(Arrays.asList(test1, test2, test3));

Using the DataQualityAnalyzer operator in RushScript

var results = dr.dataQualityAnalyzer(data, {tests:'class is not null as class_not_null, `petal length` > 0 as length_gt_zero, `petal width` > 0 as width_gt_zero'});

Properties

The DataQualityAnalyzer operator provides the following properties.

Name	Type	Description
tests	List<DataQualityAnalyzer.QualityTest> or String	The set of tests to apply to the input data set. The quality tests can be specified using an expression (String) or as a list of QualityTest instances.

Ports

The DataQualityAnalyzer operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data set to be tested.

The DataQualityAnalyzer operator provides the following output ports.

Name	Type	Get Method	Description
clean	RecordPort	getClean()	The output port for the "clean" rows.
dirty	RecordPort	getDirty()	The output port for the "dirty" rows.
model	PMMLPort	getModel()	The output port for the PMML statistics model.

SummaryStatistics Operator

The SummaryStatistics operator discovers various metrics of an input data set based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.

If the detail level is SINGLE_PASS_ONLY_SIMPLE, the following statistics are calculated.

Statistic	Description	Field Types
Missing count	Number of missing values per field	all
Min	The minimum value per field	all
Max	The maximum value per field	all
Mean	The mean value per field	int, long, float, double, numeric
Stddev	The standard deviation per field	int, long, float, double, numeric
Variance	The variance per field	int, long, float, double, numeric
Sum	The sum per field	int, long, float, double, numeric
Sum of squares	The sum of squares per field	int, long, float, double, numeric

If the detail level is SINGLE_PASS_ONLY, all of the statistics that are calculated for SINGLE_PASS_ONLY_SIMPLE are calculated. In addition, the following are also calculated.

Statistic	Description	Field Types
Correlation	A matrix is produced where the elements correspond to correlation of pairs of fields.	int, long, float, double, numeric
Covariance	A matrix is produced where the elements correspond to covariance of pairs of fields.	int, long, float, double, numeric

If the detail level is MULTI_PASS, all of the statistics that are calculated for SINGLE_PASS_ONLY are calculated. In addition, the following are also calculated.

Statistic	Description	Field Types
Intervals	Includes counts, sums, and sum of squares for equal-sized intervals. The number of intervals is configurable through the rangeCount property.	int, long, float, double, numeric
Value Counts	Includes the most frequent values and their counts for each field. The number of values to calculate per field is configurable through the showTopHowMany property.	all
Quantiles	The per-field quantiles (equi-depth histograms). The quantiles to calculate are configurable through the quantilesToCalculate property.	int, long, float, double, numeric
Median	The per-field median value.	int, long, float, double, numeric
Inter-quartile-range	The per-field inter-quartile-range.	int, long, float, double, numeric

IMPORTANT! The correct data type must be selected to avoid overflows. If overflows occur, try increasing the size of the data type from float to double or double to numeric.

Code Example

This example calculates summary statistics for the Iris data set. The SummaryStatistics operator produces a PMML model containing summary statistics. This example writes the PMML to a file. It also obtains an in-memory reference to the statistics and outputs to a file.

Using the SummaryStatistics operator in Java

import static com.pervasive.datarush.types.TokenTypeConstant.DOUBLE;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.analytics.pmml.PMMLModel;
import com.pervasive.datarush.analytics.pmml.PMMLPort;
import com.pervasive.datarush.analytics.pmml.WritePMML;
import com.pervasive.datarush.analytics.stats.DetailLevel;
import com.pervasive.datarush.analytics.stats.PMMLSummaryStatisticsModel;
import com.pervasive.datarush.analytics.stats.SummaryStatistics;
import com.pervasive.datarush.analytics.stats.UnivariateStats;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.model.GetModel;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.types.RecordTokenType;

public class IrisSummaryStats {
    public static void main(String[] args) {

        // Create an empty logical graph
        LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("SummaryStats");

        // Create a delimited text reader for the Iris data
        ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));
        reader.setFieldSeparator(" ");
        reader.setHeader(true);
        RecordTokenType irisType = record(
                DOUBLE("sepal length"),
                DOUBLE("sepal width"),
                DOUBLE("petal length"),
                DOUBLE("petal width"),
                STRING("class"));
        reader.setSchema(TextRecord.convert(irisType));

        // Run summary statistics on the data and normalized values
        SummaryStatistics summaryStats = graph.add(new SummaryStatistics());
        summaryStats.setDetailLevel(DetailLevel.MULTI_PASS);
        summaryStats.setShowTopHowMany(25);
        graph.connect(reader.getOutput(), summaryStats.getInput());

        // Use the GetModel operator to obtain a reference to the statistics model.
        // This reference is valid after the graph is run and can be used to then
        // access the statistics model outside of the graph.
        GetModel<PMMLModel> modelOp = graph.add(new GetModel<PMMLModel>(PMMLPort.FACTORY));
        graph.connect(summaryStats.getOutput(), modelOp.getInput());

        // Write the PMML generated by summary stats
        WritePMML pmmlWriter = graph.add(new WritePMML("results/iris-summarystats.pmml"));
        graph.connect(summaryStats.getOutput(), pmmlWriter.getModel());

        // Compile and run the graph
        graph.run();

        // Use the model reference to get the actual stats model
        PMMLSummaryStatisticsModel statsModel = (PMMLSummaryStatisticsModel) modelOp.getModel();

        // Print out stats for the numeric fields
        for (String fieldName : new String[] {"sepal length", "sepal width", "petal length", "petal width"}) {
            UnivariateStats fieldStats = statsModel.getFieldStats(fieldName);
            System.out.println("Field: " + fieldName);
            System.out.println("  frequency = " + fieldStats.getTotalFrequency());
            System.out.println("  missing   = " + fieldStats.getMissingFrequency());
            System.out.println("  min       = " + fieldStats.getNumericInfo().getMin());
            System.out.println("  max       = " + fieldStats.getNumericInfo().getMax());
            System.out.println("  mean      = " + fieldStats.getNumericInfo().getMean());
            System.out.println("  stddev    = " + fieldStats.getNumericInfo().getStddev());
        }
    }
}

Using the SummaryStatistics operator in RushScript

var results = dr.summaryStatistics(data, {includedFields:"sepal length", detailLevel:DetailLevel.MULTI_PASS});

Properties

The SummaryStatistics operator has the following properties.

Name	Type	Description
detailLevel	DetailLevel	The detail level that is used to compute statistics. The default value is SINGLE_PASS_ONLY.
showTopHowMany	int	Provides a cap on the number of value counts to calculate. Default: 25. Memory usage is proportional to the number of distinct values; thus only the top showTopHowMany values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if the detail level is SINGLE_PASS_ONLY.
rangeCount	int	The number of interval counts to calculate for each numeric field. The default value is 10. This setting is ignored if detail level is SINGLE_PASS_ONLY.
quantilesToCalculate	List<BigDecimal>	The quantiles to calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles).
includedFields	List<String>	The fields from the input data set for which we are collecting statistics. The default value of "empty list" implies "all fields".
fewDistinctValuesHint	boolean	This should be set to true if a small number of values per column is expected. This will have a large performance benefit, particularly in the cluster, since we can then avoid the overhead of parallelizing computation of quantiles, and so on.

Ports

The SummaryStatistics operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data set that is used to build the summary model.

The SummaryStatistics operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	Returns a PMML model corresponding to the ModelStats element described at http://www.dmg.org/v4-0-1/Statistics.html.

DistinctValues Operator

The DistinctValues operator calculates the distinct values of the given input field. This produces a record consisting of the input field with only the distinct values, and a count field with the number of occurrences of each value.

Code Example

This example calculates the number of distinct types of iris present in the data set.

Using the DistinctValue operator in Java

Using the DistinctValues operator in RushScript

var results = dr.distinctValues(data, {inputField:"class"});

Properties

The DistinctValues operator provides the following properties.

Name	Type	Description
inputField	String	The input field for which we calculate distinct values.
sortByCount	boolean	Whether to sort the output by value count.

Ports

The DistinctValues operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The DistinctValues operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output consisting of two fields, the distinct values and the counts.

NormalizeValues Operator

The NormalizeValues operator applies normalization methods to fields within an input data flow. The results of the normalization methods are available in the output flow. All input fields are present in the output with the addition of the calculated normalizations.

Normalization methods require certain statistics about the input data such as the mean, standard deviation, minimum value, maximum value, and so on. These statistics are captured in a PMMLModel. The statistics can be gathered by an upstream operator such as SummaryStatistics and passed into this operator. If not, they will be calculated with a first pass over the data and then applied in a second pass.

Code Example

This example normalizes the Iris data set using the z-score method.

Using the NormalizeValues operator in Java

Using the NormalizeValues operator in RushScript

var scoreFields = ["sepal length", "sepal width", "petal length", "petal width"];
var results = dr.normalizeValues(data, {scoreFields:scoreFields, method:NormalizeMethod.ZSCORE});

Properties

The NormalizeValues operator provides the following properties.

Name	Type	Description
includeInputFields	boolean	The indicator of whether to include the input fields in the output data. Setting this property to true causes the input values to be transferred to the output. Otherwise the input values are excluded, leaving only the transformed fields in the output data. Default: true.
method	NormalizeMethod	The normalization method to use.
scoreFields	List<String>	The names of the input fields to normalize. If no field names are provided, all fields will be transformed by default.

Ports

The NormalizeValues operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.
modelInput	PMMLPort	getModelInput()	The optional input port used to provide the PMML model containing field statistics needed by normalization methods.

The NormalizeValues operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The normalized output data.

Rank Operator

The Rank operator is used to rank data using the given rank mode. The data is grouped by the given partition fields and is sorted within the grouping by the ranking fields. An example is to rank employees by salary per department. To rank the highest to lowest salary within department: partition by the department and rank by the salary in descending sort order.

Three different rank modes are supported:

STANDARD

Also known as competition ranking, items with the same ranking values have the same rank and then a gap is left in the ranking numbers. For example: 1224

DENSE

Items that comparison determines are equal receive the same ranking. Items following those receive the next ordinal ranking (that is, ranks are not skipped). For example: 1223

ORDINAL

Each item receives a distinct ranking, starting at one and increasing by one, producing essentially a row number within the partition. For example: 1234

A new output field is created to contain the result of the ranking. The field is named "rank" by default.

Code Example

In this example we use the Rank operator to order the Iris data set by the "sepal length" field, partitioning by the "class" field.

Using the Rank operator in Java

Using the Rank operator in RushScript

var results = dr.rank(data, {partitionBy:'class', rankBy:'"sepal length" desc', mode:RankMode.STANDARD});

Properties

The Rank operator provides the following properties.

Name	Type	Description
mode	RankMode	The ranking mode. Ordinal ranking is used by default.
outputField	String	The name of the output field containing the result of the ranking. Defaults to "rank".
partitionKeys	List<String>	The fields used to partition the data. Must specify a minimum of at least one field.
rankKeys	List<SortKey>	The fields used to rank the data. A list of Strings can also be used in which case the sort order defaults to 'ascending'. The data within each partition is sorted by the specified order. This specifies the set of fields used to calculate the rating within each partitioned group.

Ports

The Rank operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The Rank operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The original data with the additional rank field.

SumOfSquares Operator

The SumOfSquares operator computes the sum of squares for the given fields in the input data. The inner products are calculated in a distributed fashion with a reduction at the end to produce the sum of squares matrix. Note that all the fields must be of type double or be assignable to a double type.

Code Example

The following example demonstrates computing the sum of squares matrix over three double fields.

Using the SumOfSquares operator in Java

// Calculate the Sum of Squares
SumOfSquares sos = graph.add(new SumOfSquares());
sos.setFieldNames(Arrays.asList(new String[]{"dblfield1", "dblfield2", "dblfield3"}));

Using the SumOfSquares operator in RushScript

var results = dr.sumOfSquares(data, {
fieldNames:['dblfield1', 'dblfield2', 'dblfield3']});

Properties

The SumOfSquares operator provides the following properties.

Name	Type	Description
fieldNames	List<String>	The list of fields to apply sum of squares. The field names must be valid names within the schema of the input port. The fields types must be compatible with the double type.

Ports

The SumOfSquares operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data used to build the model.

The SumOfSquares operator provides a single output port.

Name	Type	Get Method	Description
output	SimpleModelPort	getOutput()	This port will contain the sum of squares for the specified fields in a matrix. Only a single token will be available on this output port.

CountRanges Operator

The CountRanges operator is used to determine the range bin each value in the input data set falls in. It will calculate the total number of data values that fall within each range. The value ranges are automatically sorted in ascending order and the entire range of possible values is always considered. The operation is defined by a list of breakpoints that are used as the boundaries for the ranges. A list of n breaks defines n+1 range groups which are indexed beginning with 1. The first and last groups are unbounded on one side each. The range groups are sorted in ascending order based on the comparable interface of the field. The behavior of range intervals closures can also be adjusted by enabling closed lower or upper bounds which will include values equal to the boundary in the respective interval. A value can only be included in a single range group so both the lower and upper bound cannot be closed. Any value which is not included in any range group such as null or the boundary values is included in group 0.

A new output field is created to contain the range group of the specified field. The field is named after the original field with "_RangeGroup" appended to the name. The statistics output of this operator outputs the counts of the defined range groups. This output includes two fields, the range group index and the total number of values within that group.

Code Example

In this example, we use the CountRanges operator to count values of the Iris data set and bin the values of the "petal length" field.

Using the CountRanges operator in Java

// Create an empty logical graph

LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("CountRangesIris");

//Create a delimited text reader for the Iris data

ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/iris.txt"));

reader.setFieldSeparator(",");

reader.setHeader(true);

RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));

reader.setSchema(TextRecord.convert(irisType));

//To ensure sort order is preserved

AssertSorted assertSort = graph.add(new AssertSorted());

assertSort.setOrdering("petal length");

graph.connect(reader.getOutput(), assertSort.getInput());

//Initialize CountRanges Operator

CountRanges countRanges = graph.add(new CountRanges());

countRanges.setFieldName(fieldName);

countRanges.setBreaks(breaks);

countRanges.setLowerBoundClosed(lowerClosed);

countRanges.setUpperBoundClosed(upperClosed);

//Connect the reader to CountRanges

graph.connect(assertSort.getOutput(), countRanges.getInput());

// write the data with CountRanges

WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/iris-CountRanges.txt", WriteMode.OVERWRITE));

writer.setFieldDelimiter("");

writer.setHeader(true);

writer.disableParallelism();

String statsOutputPath = outputPath.replace(".txt", "-stats.txt");

WriteDelimitedText statswriter = graph.add(new WriteDelimitedText(statsOutputPath, WriteMode.OVERWRITE));

statswriter.setFieldDelimiter("");

statswriter.setHeader(true);

statswriter.disableParallelism();

//Connect CountRanges to the writer

graph.connect(countRanges.getOutput(), writer.getInput());

graph.connect(countRanges.getStatsOutput(), statswriter.getInput());

// Compile and run the graph

graph.run();

Using the CountRanges operator in RushScript

var breaks = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

var results = dr.countRanges(data, {fieldName:"petal length", upperBoundClosed:true, breaks:breaks})

Properties

The CountRanges operator provides the following properties.

Name	Type	Description
fieldName	String	The name of the field will be divided into ranges.
breaks	List	The values that will be used as the boundaries for the ranges. These should be of the same type as the selected fields.
lowerBoundClosed	boolean	If the lower boundary defined by a range should be included in the group. Default is false.
upperBoundClosed	boolean	If the upper boundary defined by a range should be included in the group. Default is false.

Ports

The CountRagnes operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The CountRanges operator provides two output ports.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The original data with the additional range group field.
statsOutput	RecordPort	getStatsOutput()	The count data for the defined ranges.

EqualRangeBinning Operator

The EqualRangeBinning operator determines the equally ranged bins each numeric value should be placed in within a given total range of values. The output of the operator includes the addition of a single integer field that includes the bin the selected field falls within that ranges from 1 to n where n is the total number of bins defined by the user.

The desired number of bins must be specified although the lower or upper bound may optionally be omitted. If the bounds are not defined the operator will determine an appropriate value for a missing bound based on the minimum and maximum values discovered in the data during runtime. Any null values or values outside of the inclusion range defined by the bounds will be considered an outlier and will either be filtered from the data or optionally included within bin 0 if desired.

Additionally the lower and upper bound for each individual bin can optionally be output with the bin values as two additional fields so the ranges for each bin are available. The ranges for each bin consist of a open lower bound and a closed upper bound for all bins except the maximum bin which also has an open upper bound. Any outlier values will always be contained in bin 0 if included.

Code Example

In this example we are binning one of the lengths available in the iris data set.

Using the EqualRangeBinning operator in Java

// Create the binning operator and add it to the graph

EqualRangeBinning equalRangeBinner = graph.add(new EqualRangeBinning("sepal length", 4));

equalRangeBinner.setLowerBound(5.0);

equalRangeBinner.setUpperBound(7.0);

Using the EqualRangeBinning operator in RushScript

var equalRangeBinner = dr.equalRangeBinning(data, {fieldName:'sepal length', binCount:4, lowerBound:5.0, upperBound:7.0});

Properties

The EqualRangeBinning operator provides the following properties.

Name	Type	Description
fieldName	String	The name of the numeric field to bin.
binCount	int	The number of equally ranged bins to use.
lowerBound	numeric	The lower bound on the first bin.
upperBound	numeric	The upper bound on the last bin.
includeOutlier	boolean	Whether the output includes or filters outliers.
includeRanges	boolean	Whether the output explicitly includes the ranges on each bin.

Ports

The EqualRangeBinning operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data including the field to bin.

The EqualRangeBinning operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The data with included bins. Contains the full original input data unless outliers are filtered as well as new fields including the bin and optionally ranges.

MostFrequentValues Operator

The MostFrequentValues operator is used to determine which values are the most frequent within the selected fields of the input data. A maximum must be specified to indicate how many of the most common values should be output.

The output contains two fields for each field selected from the input. The fields will include the value field from the input with the topmost frequent values and a field associated with each that contains the frequency count.

Code Example

In this example we the MostFrequentValues operator to find the top 5 values in each of the numeric fields in the Iris data set.

Using the MostFrequentValues operator in Java

// Create an empty logical graph

LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("mostFreqValuesIris");

//Create a delimited text reader for the Iris data

ReadDelimitedText reader = graph.add(new ReadDelimitedText(getResourcePath("iris.txt")));

reader.setFieldSeparator(" ");

reader.setHeader(true);

RecordTokenType irisType = record(DOUBLE("sepal length"), DOUBLE("sepal width"), DOUBLE("petal length"), DOUBLE("petal width"), STRING("class"));

reader.setSchema(TextRecord.convert(irisType));

// Initialize MostFrequentValues operator

MostFrequentValues mostFreq = graph.add(new MostFrequentValues());

if (fieldNames.length > 0) {

mostFreq.setFieldNames(fieldNames);

}

if (topNum >= 0) {

mostFreq.setShowTopHowMany(topNum);

}

//Connect the reader to MostFrequentValues

graph.connect(reader.getOutput(), mostFreq.getInput());

// write the data with MostFrequentValues

WriteDelimitedText writer = graph.add(new WriteDelimitedText(outputPath, WriteMode.OVERWRITE));

writer.setFieldDelimiter("");

writer.setHeader(true);

writer.disableParallelism();

writer.setSaveMetadata(false);

//Connect MostFrequentValues to the writer

graph.connect(mostFreq.getOutput(), writer.getInput());

// Compile and run the graph

graph.run();

Using the MostFrequentValues operator in RushScript

var freqFields = ["sepal length", "sepal width", "petal length", "petal width"];

var results = dr.MostFrequentValues(data, {fieldNames:freqFields, showTopHowMany:5})

Properties

The MostFrequentValues operator provides the following properties.

Name	Type	Description
fieldName	List<String>	The names of the input fields to calculate frequency.
showTopHowMany	int	The max number of value frequencies to calculate. The default is 25.
fewDistinctValuesHint	boolean	A hint as to whether there are expected to be a small number of distinct values.

Ports

The MostFrequentValues operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data.

The MostFrequentValues operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output consisting of two fields, the frequent values and the counts.

Last modified date: 03/10/2025