Building DataFlow Applications : Building DataFlow Applications Using RushScript : RushScript Variables
 
Share this page                  
RushScript Variables
Using Special Variables
The RushScript environment includes several special variables that can be used during composition and execution of DataFlow applications.
dr Variable
The dr variable contains methods for creating and composing DataFlow operators. It also contains many convenience methods for different tasks.
Operators
Methods for composing DataFlow operators take the following parameters:
A variable number of parameters for input data. The number of variables is dependent on the number of input ports for the operator. Optional input ports can be left out as desired.
The last parameter contains the properties to set on the operator specified in JavaScript map notation. Reference each operator for a description of the properties that can be set. JavaScript map notation takes the form: {name:value, ...}.
The methods for composing DataFlow operators are described in the following tables. The methods are listed by category in alphabetical order.
I/O Operators
 
Function Name
Operator Link
deleteFromJDBC
forceRecordStaging
loadMatrix
parseTextFields
readARFF
readAvro
readDelimitedText
readFixedText
readFromJDBC
readLog
read--
readPMML
readSource
readStagingDataset
updateInJDBC
writeARFF
writeAvro
writeDelimitedText
writeFixedText
write--
writePMML
writeSink
writeStagingDataset
writeToJDBC
 
Partitioning Data
 
Function Name
Operator Link
gatherHint
partitionHint
 
Performing SQL-like Operations
 
Function Name
Operator Link
crossJoin
filterExistingRows
filterRows
group
join
limitRows
processByGroup
sampleRandomRows
sort
unionAll
 
Manipulating Records
 
Function Name
Operator Link
columnsToRows
deriveFields
discoverEnums
mergeFields
remapFields
removeFields
retainFields
rowsToColumns
selectFields
splitFields
 
Data Cleansing
 
Function Name
Operator Link
removeDuplicates
replaceMissingValues
 
Data Matching
 
Function Name
Operator Link
analyzeDuplicates
analyzeLinks
clusterDuplicates
clusterLinks
discoverDuplicates
discoverLinks
 
Data Analysis
Association Rule Mining
 
Function Name
Operator Link
convertARMModel
fpGrowth
frequentItems
 
Clustering
 
Function Name
Operator Link
kmeans
 
Predictive Analytics
 
Function Name
Operator Link
decisionTreeLearner
decisionTreePredictor
decisionTreePruner
knnClassifier
naiveBayesLearner
naiveBayesPredictor
svmLearner
svmPredictor
 
Statistics and Profiling
 
Function Name
Operator Link
dataQualityAnalyzer
discoverDomain
distinctValues
linearRegressionLearner
logisticRegressionLearner
logisticRegressionPredictor
normalizeValues
rank
regressionPredictor
runRScript
runScript
sumOfSquares
summaryStatistics
 
Text Processing
 
Function Name
Operator Link
calculateNGramFrequency
calculateWordFrequency
convertTextCase
countTokens
dictionaryFilter
expandTextTokens
expandTextTokens
filterText
generateBagOfWords
textFrequencyFilter
textStemmer
textTokenizer
 
Asserting Conditions
 
Function Name
Operator Link
assertEqual
assertEqualHash
assertEqualTypes
assertPredicate
assertRowCount
assertSorted
 
Capturing Data
 
Function Name
Operator Link
collectRecords
getModel
emitRecords
putModel
logRows
Generating Data
 
Function Name
Operator Link
generateArithmeticSequence
generateConstant
generateRandom
generateRepeatingCycle
 
Additional Utitility Methods
Other utility methods provided by the dr variable include (in alphabetical order):
 
Function Name
Input Parameters
Returns
Description
applicationName
String: application name
 
Sets the application name. This will be used as the application name for any DataFlow graphs created.
batchSize
int: batch size (optional)
Previous batch size
Sets the ports.batchSize engine configuration to the specified value. Returns the previous value of the ports.batchSize setting. See Engine Configuration Settings for more information.
cluster
String: cluster host name
int: port number
Sets the cluster specification on the current engine configuration. The next execution invocation will execute on the defined cluster if it exists. See Engine Configuration Settings for more information. The returned Cluster Specifier object can be used to set additional run-time options.
defineOperator
String: operator name
String: fully qualified class name
 
Defines a customer operator to the scripting environment. The name of the operator must be unique and valid as a JavaScript function name (no spaces or special characters). The fully qualified class name should reference a valid Java class that implements the LogicalOperator interface. After an operator is defined, it can be used within the JavaScript environment. The operator name will be added as a function on the dr variable.
dumpFilePath
String: local path name (optional)
Previous path setting
Sets the dumpFilePath engine configuration to the specified value. Returns the previous value of the dumpFilePath setting. See Engine Configuration Settings for more information.
enabledModules
String: comma separated list of modules
 
Sets the modules that will be enabled for the current engine configuration. This is a comma-separated list of the modules that should be enabled. For a list of the currently available modules see moduleConfiguration in Engine Configuration Settings.
execute
String: application name (optional)
 
Compiles and executes the currently composed DataFlow graph.
extensionPaths
Strings: extension paths
String[] (previous extension path setting)
Sets the list of extension paths to use for job execution. This option is only valid when used for job execution on a cluster. The extension paths refer to directories in shared storage. The paths are intended to contain extensions to the DataFlow environment on a cluster. Files found in the extension paths will be copied to the current directory of the containers created to run a DataFlow job on nodes within a cluster. Files that are archives (see below) are added to the class path. These file extensions indicate a file is an archive:
.tar.gz
.tar
.zip
.jar
 
 
 
Jar files are copied as is into the local directory. The other archive file types are extracted into the local directory using a base directory name the same as the archive file. All archives are added to the class path of the container. Non-archive files are copied to the local directory of the container but are not added to the class path.
Each of the paths must be contained in a shared, distributed storage system such as HDFS.
Extension paths are only supported when executing DataFlow jobs using YARN.
include
String: JavaScript file to include
 
Evaluates the given JavaScript files into the current environment. Including other JavaScript source allows access to variables and functions that may be commonly used. The search criteria for JavaScript files is as follows:
The directory containing the RushScript file currently being evaluated.
The list of provided include files (see command line reference) is searched in order.
The current classpath is searched for the include file.
makeJoinKeys
String[]: left keys
String[]: right keys
JoinKey[]
Creates an array of JoinKey objects from the given arrays of left side field names and right side field names. The given arrays of field names should not be empty and should be equal in size. Use this function to make a set of keys for joining when the left side and right side key fields are not equal.
maxMerge
int: maxMerge value (optional)
Previous maxMerge setting
Sets the join.maxMerge engine configuration setting. Returns the previous value of the join.maxMerge setting. See Engine Configuration Settings for more information.
maxRetries
int: maxRetries value (optional)
Previous maxRetries setting
Sets the maxRetries engine configuration setting. Returns the previous value of the maxRetries setting. See Engine Configuration Settings for more information.
minParallelism
int: minimumParallelism value (optional)
Previous minimumParallelism setting
Sets the minimumParallelism engine configuration setting. Returns the previous value of the minimumParallelism setting. See Engine Configuration Settings for more information.
monitored
boolean: monitored value (optional)
Previous monitored value
Sets the monitored engine configuration setting. Returns the previous value of the monitored setting. See Engine Configuration Settings for more information.
parallelism
int: parallelism value (optional)
Previous parallelism value
Sets the parallelism engine configuration setting. Returns the previous value of the parallelism setting. See Engine Configuration Settings for more information.
schedulerQueue
String: name of the scheduler queue to use when executing a job
String: previous scheduler queue name setting
Sets the name of the scheduler queue to use when scheduling jobs. The scheduler queue name is only valid when using a cluster for job execution. Currently scheduler queue names are only supported when using YARN for job execution.
schema
 
TextRecordBuilder
Creates a new TextRecordBuilder instance. This object can be used to define a new schema or load a previously defined schema.
sizeByReaders
boolean: sizeByReaders value (optional)
Previous sizeByReaders value
Sets the ports. sizeByReaders engine configuration setting. Returns the previous value of the ports.sizeByReaders setting. See Engine Configuration Settings for more information.
sortBuffer
String: sortBuffer value (optional)
Previous sortBuffer value
Sets the sort.sortBuffer engine configuration setting. Returns the previous value of the sort.sortBuffer setting. This value is set using a text value to represent the size. Use "k", "m" and "g" suffixes to represent kilobytes, megabytes and gigabytes, respectively. See Engine Configuration Settings for more information.
sortIOBuffer
String: sortIOBuffer value (optional)
Previous sortIOBuffer value
Sets the sort . sortIOBuffer engine configuration setting. Returns the previous value of the sort.sortIOBuffer setting. This value is set using a text value to represent the size. Use "k", "m" and "g" suffixes to represent kilobytes, megabytes and gigabytes, respectively. See Engine Configuration Settings for more information.
spoolThreshold
int: spoolThreshold value (optional)
Previous spoolThreshold value
Sets the ports.spoolThreshold engine configuration setting. Returns the previous value of the ports.spoolThreshold setting. See Engine Configuration Settings for more information.
storageManagementPath
String: storageManagementPath value
Previous storageManagementPath value
Sets the storageManagementPath engine configuration setting. Returns the previous value of the storageManagementPath setting. See Engine Configuration Settings for more information.
writeAhead
int: writeAhead value (optional)
Previous writeAhead value
Sets the ports.writeAhead engine configuration setting. Returns the previous value of the ports.writeAhead setting. See Engine Configuration Settings for more information.