DF 8.2 | Data Processing Operator

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Data Processing Operator

Was this helpful?

Data Processing Operator

ProcessByGroup Operator

The ProcessByGroup operator segregates data into groups by the distinct values of the provided key fields. Each data group is streamed into another Dataflow application spawned by the operator. The spawned application of the operator is composed using the provided RushScript script. The RushScript is executed for each distinct key group and is passed a context object containing information about the current data group. The graph composed by RushScript is run and streamed the data for the data group.

The application composed by the provided RushScript is provided a result set containing the data of the current data group. The application can read other data sources as needed. There are no constraints placed on the composed application. The composed application can use the full breadth of Dataflow operators.

By default the ProcessByGroup operator will evaluate the given RushScript text or source file. The code contained within the source will compose the Dataflow application to execute. Optionally, a function name can be provided. This function will be invoked after all sources are evaluated. The function can provide further composition of the application. If a file name is provided to the RushScript source file, the name must resolve to a valid file name or be found in the include directories configured. The include directories can only be configured when composing a DataFlow application using RushScript or can be created and passed in as script options.

The ProcessByGroup operator may itself be composed by RushScript. In this case, the environment of the composing RushScript will be inherited by the spawned RushScript engines for each group. The inherited environment includes the following items:

• Include directories

• Whether script extensions are enabled or disabled

• Strictness mode

Context information is passed to the RushScript script in the form of a variable named "context". The name of the variable is controlled by the contextVariableName property. The context variable contains the following data members:

Variable name	JavaScript Type	Description
source	ResultSet	Contains the ResultSet that can be used to access the data of the current data group. This variable can be passed as a result set to any operators composed within RushScript. See the code examples for how the result set can be used.
keys	Array (String)	Contains the current set of distinct key values. These values will differ for each key group. These are provided as a convenience to the RushScript as the key values may be used to affect downstream composition.
partitionID	Number	The ProcessByGroup operator supports parallel execution. The partitionID variable uniquely identifies which partition the current data group is being processed within. It can be used in conjunction with the groupID variable to uniquely identify the current data grouping. For example, the partitionID and groupID can be used within a file name to create a unique output file (or file set) for each data group.
partitionCount	Number	Provides the total number of parallel data streams being created for the job containing this ProcessByGroup operator.
groupID	Number	Uniquely defines a data group within the parallel data stream processing the data group. When used in conjunction with the partitionID, it uniquely defines the data group for the application.

WARNING! It may be tempting to embed the provided key values in items such as file names. However, care should be taken. The key values may contain characters that are not valid for file names or other purposes. To generate a unique file name for each key group, the context.partitionID and context.groupID variables can be used. See the following examples.

Code Examples

Let’s say we want to train a classification model for each unique value of a field within our input data. We can use the ProcessByGroup operator to segregate the data by distinct values of this field and compose and execute an application to build the classification model for each data group. We can then later use the multiple decision tree models in an ensemble model to apply to data.

Here’s a snippet of data that we will want to process. The file is in ARFF format. We will segregate the data by the "outlook" field and build a model using the "windy" field as the target variable. Since we are segregating the data by the "outlook" field, we will expect three models to be built, one for each distinct value of "outlook" ("sunny", "overcast", and "rainy").

@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

The RushScript application that we want to execute follows. Note that the input data is provided in the "context.source" variable.

RushScript script to execute

// Remove the "outlook field". Do not want to include it in the model.
// Note the use of context.source as the input data set to the removeFields function.
var data = dr.removeFields(context.source, {fieldNames:'outlook'});

// Train a decision tree model using the "windy" field as the target variable.
var results = dr.decisionTreeLearner(data, {targetColumn:'windy'});

// Store the PMML file containing the model for this data group.
// Note the usage of a passed in variable (outputDir) and the creation of a unique file name
dr.writePMML(results, {targetPathName:outputDir + '/windy-model_' + context.partitionID + '_' + context.groupID + '.pmml'});

The following code demonstrates how to invoke the ProcessByGroup operator on the source data using the above script.

Using the ProcessByGroup operator in Java

// Create the operator in a graph
ProcessByGroup pbg = graph.add(new ProcessByGroup());
pbg.setKeys("outlook");

// Use the script that trains a decision tree model
pbg.setScriptFile(new File("decisiontree.js"));

// Set a variable that will be available in the RushScript environment
String targetDir = ... // where the output models should be written
pbg.addVariable("outputDir", targetDir);

Using the ProcessByGroup operator in RushScript

// Want to specify the output directory programatically
var targetDir = ...;

// Invoke the process by group operator function. It has no return value.
// The "data" variable is assumed to result from a previous operator function call.
dr.processByGroup(data, {keys:'outlook', variables:{outputDir:targetDir}, scriptFile:'decisiontree.js'});

Properties

The ProcessByGroup operator has the following properties.

Name	Type	Description
contextVariableName	String	(Optional) The name of the variable injected into the RushScript environment that contains context of the current sub-graph composition and execution. The default value is "context".
functionName	String	The name of a JavaScript function to invoke after the JavaScript source has been evaluated. This function can provide (additional) composition of the Dataflow application. This is useful when a single source file has many functions that provide composition of different applications. In this way, the same script file can be used to compose and run different applications. This is an optional property.
keys	String[] or List<String>	The list of field names used to define key groups. At least one field name must be provided.
script	String	The text of the RushScript code executed for each distinct key group. This RushScript code is used to compose the application that processes each data group. This property is required. A convenience method named setScriptFile() is provided that accepts a java.io.File object. When used, the script text will be read from the provided file.
scriptFileName	String	The pathname to the source file containing the RushScript code to execute for each distinct key group. If the file name is not an absolute path and not found in the current directory, the file will be searched for within the configured include directories. The include directories are only valid if the application containing this operator was composed using RushScript. If the file is not found in the include directories, an exception is issued. The file must exist on the machine where the application is composed.
scriptOptions	ScriptOptions	The scripting command line options applicable to operators that may spawn a script itself, such as the ProcessByGroup operator. This property is used internally by the scripting framework and usually does not have to be used by end users. It can be used to artificially set up the scripting options wanted to be passed on to scripting environments set up by the ProcessByGroup operator.
variables	Map<String, Object>	(Optional) A map of variable names to their values. The variable names must be valid JavaScript variable names. These variables will be injected into the RushScript environment during the composition phase of each sub-graph executed for data groups. A convenience method named addVariable() can be used to set a single variable.

Ports

The ProcessByGroup operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data

The ProcessByGroup operator has no output ports.

Last modified date: 03/10/2025