DF 8.2 | Run R Script Operator

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Run R Script Operator

Was this helpful?

Run R Script Operator

RunRScript Operator

The RunRScript operator is used to execute a snippet of R code within the context of a DataFlow operator. The R script is executed for all data contained within the given input flow. The output of the R script is pushed as the output of the operator.

The R environment must be installed on the execution platform. If this operator is to be used in a distributed environment, the R environment must be installed in all worker nodes of the cluster. The installation path must be the same on all nodes. Visit the CRAN (Comprehensive R Archive Network) website for download links and installation instructions.

By default, the RunRScript operator executes in parallel. This implies that the given R script has no data dependencies and can run independently on segmented data and produce the correct results. If this is not the case, disable parallel execution for the operator.

Note that the operator gathers all of its input data to hand off to R. The R engine then loads the data into memory in preparation for executing the script. The input data must therefore fit into memory. In a parallel or distributed environment, the data is segmented, and each segment of data is processed in a separate R engine. In this case, each data segment must fit into the memory of its executing engine.

The R script is handed an R data frame in a variable named "R". The script can use the data frame as desired. The resultant data is gathered from a data frame of the same name. Two variables are set within the R environment to support parallel execution:

partitionID

Specifies the zero-based identifier of the partition the current instance is running in

partitionCount

Specifies the total number of data partitions executing for this graph. For the scripting operator, this equates to the total number of instances of the operator replicated for execution.

These variables are numeric R variables and can be accessed directly within the user-provided R script.

The sequence of operations are these:

1. Data from the input port is cached in a local disk file.

2. The given R script is wrapped with statements to load the data from the file and store the results to another file.

3. The R script is executed using the Rscript executable.

4. The resultant data file is parsed and the data is pushed to the output of the operator.

5. The temporary input/output files are deleted.

Code Examples

The following code example demonstrates using the RunRScript operator. First, a set of R statements is defined. Next, the RunRScript operator is created and the required properties are set. Code to connect the input and output ports of the operator is not shown.

Note how the R script uses the data frame in the variable named "R" and sets the data to be output in a data frame of the same name.

Using RunRScript in Java

// Define a snippet of R code to apply to the input data
String scriptText =
            "myfunction <- function(x) { " +
            "    return (x * 2);" +
            "}\n" +
            "tmpVal <- sapply(R$sepal.length, myfunction);" +
            "R <- data.frame(tmpVal);";

// Create the script operator and set the required properties.
RunRScript script = app.add(new RunRScript());
script.setPathToRScript("/usr/bin/Rscript");
script.setOutputType(record(DOUBLE("result"));
script.setScriptSnippet(scriptText);

Using RunRScript in RushScript

var scriptText =
            'myfunction <- function(x) { ' +
            '    return (x * 2);' +
            '}\n' +
            'tmpVal <- sapply(R$sepal.length, myfunction);' +
            'R <- data.frame(tmpVal);';

var script = dr.runRScript(
    data, {
        pathToRScript:'/usr/bin/Rscript',
        scriptSnippet:scriptText,
        outputType:dr.schema().DOUBLE('result')} );

Properties

The RunRScript operator supports the following properties.

Name	Type	Description
charset	String	Sets character set value that is used to format input and output data for R. Default: UTF-8.
outputType	RecordTokenType	The number of characters to read for performing schema discovery and structural analysis.
pathToRScript	String	Fully qualified file path to the Rscript executable in the local R installation. This executable is usually found within the bin directory of the R installation.
scriptSnippet	String	The snippet of R code to execute within the R environment.
requiredDataDistribution	DataDistribution	Sets the required data distribution of the input data port of this operator. The R scripting operator has no knowledge of the processing being done by its script. As such, it cannot set its metadata for input or output. Exposing the data distribution allows setting the required distribution for the input port. For example, if the script is using vertical partitioning, the required data distribution can be set to FullDataDistribution to ensure each replica of the operator sees all of the input data.

Ports

The RunRScript operator supports the following input ports:

Name	Type	Get method	Description
input	RecordPort	getInput()	The input record data to apply the R script against.

The RunRScript operator supports the following output ports:

Name	Type	Get method	Description
output	RecordPort	getOutput()	Data resulting from the application of the given R script.

Last modified date: 03/10/2025