Run R Script Operator
RunRScript Operator
The
RunRScript operator is used to execute a snippet of R code within the context of a DataFlow operator. The R script is executed for all data contained within the given input flow. The output of the R script is pushed as the output of the operator.
The R environment must be installed on the execution platform. If this operator is to be used in a distributed environment, the R environment must be installed in all worker nodes of the cluster. The installation path must be the same on all nodes. Visit the
CRAN (Comprehensive R Archive Network) website for download links and installation instructions.
By default, the
RunRScript operator executes in parallel. This implies that the given R script has no data dependencies and can run independently on segmented data and produce the correct results. If this is not the case, disable parallel execution for the operator.
Note that the operator gathers all of its input data to hand off to R. The R engine then loads the data into memory in preparation for executing the script. The input data must therefore fit into memory. In a parallel or distributed environment, the data is segmented, and each segment of data is processed in a separate R engine. In this case, each data segment must fit into the memory of its executing engine.
The R script is handed an R data frame in a variable named "R". The script can use the data frame as desired. The resultant data is gathered from a data frame of the same name. Two variables are set within the R environment to support parallel execution:
partitionID
Specifies the zero-based identifier of the partition the current instance is running in
partitionCount
Specifies the total number of data partitions executing for this graph. For the scripting operator, this equates to the total number of instances of the operator replicated for execution.
These variables are numeric R variables and can be accessed directly within the user-provided R script.
The sequence of operations are these:
1. Data from the input port is cached in a local disk file.
2. The given R script is wrapped with statements to load the data from the file and store the results to another file.
3. The R script is executed using the Rscript executable.
4. The resultant data file is parsed and the data is pushed to the output of the operator.
5. The temporary input/output files are deleted.
Code Examples
The following code example demonstrates using the
RunRScript operator. First, a set of R statements is defined. Next, the RunRScript operator is created and the required properties are set. Code to connect the input and output ports of the operator is not shown.
Note how the R script uses the data frame in the variable named "R" and sets the data to be output in a data frame of the same name.
Using RunRScript in Java
// Define a snippet of R code to apply to the input data
String scriptText =
"myfunction <- function(x) { " +
" return (x * 2);" +
"}\n" +
"tmpVal <- sapply(R$sepal.length, myfunction);" +
"R <- data.frame(tmpVal);";
// Create the script operator and set the required properties.
RunRScript script = app.add(new RunRScript());
script.setPathToRScript("/usr/bin/Rscript");
script.setOutputType(record(DOUBLE("result"));
script.setScriptSnippet(scriptText);
Using RunRScript in RushScript
var scriptText =
'myfunction <- function(x) { ' +
' return (x * 2);' +
'}\n' +
'tmpVal <- sapply(R$sepal.length, myfunction);' +
'R <- data.frame(tmpVal);';
var script = dr.runRScript(
data, {
pathToRScript:'/usr/bin/Rscript',
scriptSnippet:scriptText,
outputType:dr.schema().DOUBLE('result')} );
Properties
The
RunRScript operator supports the following properties.
Ports
The
RunRScript operator supports the following input ports:
The
RunRScript operator supports the following output ports:
Last modified date: 03/10/2025