RushScript Variables

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications Using RushScript > RushScript Variables

Was this helpful?

Using Special Variables

The RushScript environment includes several special variables that can be used during composition and execution of DataFlow applications.

dr Variable

The dr variable contains methods for creating and composing DataFlow operators. It also contains many convenience methods for different tasks.

Operators

Methods for composing DataFlow operators take the following parameters:

• A variable number of parameters for input data. The number of variables is dependent on the number of input ports for the operator. Optional input ports can be left out as desired.

• The last parameter contains the properties to set on the operator specified in JavaScript map notation. Reference each operator for a description of the properties that can be set. JavaScript map notation takes the form: {name:value, ...}.

The methods for composing DataFlow operators are described in the following tables. The methods are listed by category in alphabetical order.

I/O Operators

Function Name	Operator Link
deleteFromJDBC	Using the DeleteFromJDBC Operator to Write Database Deletes
forceRecordStaging	Using the ForceRecordStaging Operator to Explicitly Stage Data
loadMatrix	HBase Operators
parseTextFields	Using the ParseTextFields Operator to Parse Text Records
readARFF	Using the ReadARFF Operator to Read Sparse Data
readAvro	Using the ReadAvro Operator to Read Apache Avro Data
readDelimitedText	Using the ReadDelimitedText Operator to Read Delimited Text
readFixedText	Using the ReadFixedText Operator to Read Fixed-width Text
readFromJDBC	Using the ReadFromJDBC Operator to Read Databases
readLog	Using the ReadLog Operator to Read Log Data
readOrc	Using the ReadORC Operator to Read Apache ORC Files
readPMML	Using the ReadPMML Operator to Read PMML
readSource	Using the ReadSource Operator to Read Generic Sources
readStagingDataset	Using the ReadStagingDataset Operator to Read Staging Datasets
updateInJDBC	Using the UpdateInJDBC Operator to Write Database Updates
writeARFF	Using the WriteARFF Operator to Write Sparse Data
writeAvro	Using the WriteAvro Operator to Write Apache Avro Data
writeDelimitedText	Using the WriteDelimitedText Operator to Write Delimited Text
writeFixedText	Using the WriteFixedText Operator to Write Fixed-width Text
writeOrc	Using the WriteORC Operator to Write Apache ORC Files
writePMML	Using the WritePMML Operator to Write PMML
writeSink	Using the WriteSink Operator to Write Generic Targets
writeStagingDataset	Using the WriteStagingDataset Operator to Write Staging Data Sets
writeToJDBC	Using the WriteToJDBC Operator to Write to Databases

Partitioning Data

Function Name	Operator Link
gatherHint	Using the GatherHint Operator to Force a Data Gather
partitionHint	Using the PartitionHint Operator to Explicitly Partition Data

Performing SQL-like Operations

Function Name	Operator Link
crossJoin	Using the CrossJoin Operator to Cross Products
filterExistingRows	Using the FilterExistingRows Operator to Filter by Data Set Membership
filterRows	Using the FilterRows Operator to Filter by Predicate
group	Using the Group Operator to Compute Aggregations
join	Using the Join Operator to Do Standard Relational Joins
limitRows	Using the LimitRows Operator to Limit Output Rows
processByGroup	Using the ProcessByGroup Operator to Process Data by Distinct Key Groups
sampleRandomRows	Using the SampleRandomRows Operator to Sample Data
sort	Using the Sort Operator to Sort Data Sets
unionAll	Using the UnionAll Operator to Create Union of Data Sets

Manipulating Records

Function Name	Operator Link
columnsToRows	Using the ColumnsToRows Operator to Convert Columns to Rows (Unpivot)
deriveFields	Using the DeriveFields Operator to Compute New Fields
discoverEnums	Using the DiscoverEnums Operator to Discover Enumerated Data Types
mergeFields	Using the MergeFields Operator to Merge Fields
remapFields	Using the RemapFields Operator to Rename Fields
removeFields	Using the RemoveFields Operator to Remove Fields
retainFields	Using the RetainFields Operator to Retain Fields
rowsToColumns	Using the RowsToColumns Operator to Convert Rows to Columns (Pivot)
selectFields	Using the SelectFields Operator to Select Fields
splitFields	Using the SplitField Operator to Split Fields

Data Cleansing

Function Name	Operator Link
removeDuplicates	Using the RemoveDuplicates Operator to Remove Duplicates
replaceMissingValues	Using the ReplaceMissingValues Operator to Replace Missing Values

Data Matching

Function Name	Operator Link
analyzeDuplicates	Using the AnalyzeDuplicateKeys Operator to Analyze Duplicates
analyzeLinks	Using the AnalyzeLinkKeys Operator to Analyze Links
clusterDuplicates	Using the ClusterDuplicates Operator to Cluster Duplicates
clusterLinks	Using the ClusterLinks Operator to Cluster Links
discoverDuplicates	Using the DiscoverDuplicates Operator to Discover Duplicates
discoverLinks	Using the DiscoverLinks Operator to Discover Links

Data Analysis

Association Rule Mining

Function Name	Operator Link
convertARMModel	Using the ConvertARMModel Operator to Convert Association Models from PMML
fpGrowth	Using the FPGrowth Operator to Determine Frequent Pattern Growth
frequentItems	Using the FrequentItems Operator to Compute Item Frequency

Clustering

Function Name	Operator Link
kmeans	Using the KMeans Operator to Compute K-Means

Predictive Analytics

Function Name	Operator Link
decisionTreeLearner	Using the DecisionTreeLearner Operator to Construct a Decision Tree PMML Model
decisionTreePredictor	Using the DecisionTreePredictor Operator for Decision Tree Predicting
decisionTreePruner	Using the DecisionTreePruner Operator for Decision Tree Pruning
knnClassifier	Using the KNNClassifier Operator to Classify K-Nearest Neighbors
naiveBayesLearner	Using the NaiveBayesLearner Operator
naiveBayesPredictor	Using the NaiveBayesPredictor Operator
svmLearner	Using the SVMLearner Operator to Build a PMML Support Vector Machine Model
svmPredictor	Using the SVMPredictor Operator to Apply a Support Vector Machine Model

Statistics and Profiling

Function Name	Operator Link
dataQualityAnalyzer	Using the DataQualityAnalyzer Operator to Analyze Data Quality
discoverDomain	Using the DiscoverDomain Operator to Discover Domains
distinctValues	Using the DistinctValues Operator to Find Distinct Values
linearRegressionLearner	Using the LinearRegressionLearner Operator to Learn Linear Regressions
logisticRegressionLearner	Using the LogisticRegressionLearner Operator to Perform Stochastic Gradient Descent
logisticRegressionPredictor	Using the LogisticRegressionPredictor Operator to Apply Classification Models
normalizeValues	Using the NormalizeValues Operator to Normalize Values
rank	Using the Rank Operator to Rank Data
regressionPredictor	Using the RegressionPredictor Operator to Apply Regression Models
runRScript	Using the RunRScript Operator to Invoke R Scripts
runScript	Using the RunScript Operator
sumOfSquares	Using the SumOfSquares Operator to Compute a Sum of Squares
summaryStatistics	Using the SummaryStatistics Operator to Calculate Data Statistics

Text Processing

Function Name	Operator Link
calculateNGramFrequency	Using the CalculateNGramFrequency Operator to Calculate N-gram Frequencies
calculateWordFrequency	Using the CalculateWordFrequency Operator to Calculate Word Frequencies
convertTextCase	Using the ConvertTextCase Operator to Convert Case
countTokens	Using the CountTokens Operator to Count Tokens
dictionaryFilter	Using the DictionaryFilter Operator to Filter Based on Dictionaries
expandTextTokens	Using the ExpandTextFrequency Operator to Expand Text Frequencies
expandTextTokens	Using the ExpandTextTokens Operator to Expand Text Tokens
filterText	Using the FilterText Operator to Filter Tokenized Text
generateBagOfWords	Using the GenerateBagOfWords Operator to Expand Text Frequencies
textFrequencyFilter	Using the TextFrequencyFilter Operator to Filter Frequencies
textStemmer	Using the TextStemmer Operator to Stem Text
textTokenizer	Using the TextTokenizer Operator to Tokenize Text Strings

Asserting Conditions

Function Name	Operator Link
assertEqual	Using the AssertEqual Operator to Assert Data Equality
assertEqualHash	Using the AssertEqualHash Operator to Assert Hash Equality
assertEqualTypes	Using the AssertEqualTypes Operator to Assert Data Type Equality
assertPredicate	Using the AssertPredicate Operator to Assert a Predicate Condition
assertRowCount	Using the AssertRowCount Operator to Assert Row Count
assertSorted	Using the AssertSorted Operator to Assert Data Ordering

Capturing Data

Function Name	Operator Link
collectRecords	Using the CollectRecords Operator to Capture Records Data
getModel	Using the GetModel Operator to Capture Model Data
emitRecords	Using the EmitRecords Operator to Emit Record Data
putModel	Using the PutModel Operator to Emit Model Data
logRows	Using the LogRows Operator to Log Record Data

Generating Data

Function Name	Operator Link
generateArithmeticSequence	Using the GenerateArithmeticSequence Operator to Generate Sequences
generateConstant	Using the GenerateConstant Operator to Generate Constants
generateRandom	Using the GenerateRandom Operator to Generate Random Data
generateRepeatingCycle	Using the GenerateRepeatingCycle Operator to Generate Repeating Cycles

Additional Utitility Methods

Other utility methods provided by the dr variable include (in alphabetical order):

Function Name	Input Parameters	Returns	Description
applicationName	String: application name		Sets the application name. This will be used as the application name for any DataFlow graphs created.
batchSize	int: batch size (optional)	Previous batch size	Sets the ports.batchSize engine configuration to the specified value. Returns the previous value of the ports.batchSize setting. See Engine Configuration Settings for more information.
cluster	• String: cluster host name • int: port number	Cluster Specifier	Sets the cluster specification on the current engine configuration. The next execution invocation will execute on the defined cluster if it exists. See Engine Configuration Settings for more information. The returned Cluster Specifier object can be used to set additional run-time options.
defineOperator	• String: operator name • String: fully qualified class name		Defines a customer operator to the scripting environment. The name of the operator must be unique and valid as a JavaScript function name (no spaces or special characters). The fully qualified class name should reference a valid Java class that implements the LogicalOperator interface. After an operator is defined, it can be used within the JavaScript environment. The operator name will be added as a function on the dr variable.
dumpFilePath	String: local path name (optional)	Previous path setting	Sets the dumpFilePath engine configuration to the specified value. Returns the previous value of the dumpFilePath setting. See Engine Configuration Settings for more information.
enabledModules	String: comma separated list of modules		Sets the modules that will be enabled for the current engine configuration. This is a comma-separated list of the modules that should be enabled. For a list of the currently available modules see moduleConfiguration in Engine Configuration Settings.
execute	String: application name (optional)		Compiles and executes the currently composed DataFlow graph.
extensionPaths	Strings: extension paths	String[] (previous extension path setting)	Sets the list of extension paths to use for job execution. This option is only valid when used for job execution on a cluster. The extension paths refer to directories in shared storage. The paths are intended to contain extensions to the DataFlow environment on a cluster. Files found in the extension paths will be copied to the current directory of the containers created to run a DataFlow job on nodes within a cluster. Files that are archives (see below) are added to the class path. These file extensions indicate a file is an archive: • .tar.gz • .tar • .zip • .jar
			Jar files are copied as is into the local directory. The other archive file types are extracted into the local directory using a base directory name the same as the archive file. All archives are added to the class path of the container. Non-archive files are copied to the local directory of the container but are not added to the class path. Each of the paths must be contained in a shared, distributed storage system such as HDFS. Extension paths are only supported when executing DataFlow jobs using YARN.
fileConfigurations		File Configuration	Creates a new FileConfigurations instance. This object can be used to define a new filesystem configuration.
include	String: JavaScript file to include		Evaluates the given JavaScript files into the current environment. Including other JavaScript source allows access to variables and functions that may be commonly used. The search criteria for JavaScript files is as follows: • The directory containing the RushScript file currently being evaluated. • The list of provided include files (see command line reference) is searched in order. • The current classpath is searched for the include file.
makeJoinKeys	• String[]: left keys • String[]: right keys	JoinKey[]	Creates an array of JoinKey objects from the given arrays of left side field names and right side field names. The given arrays of field names should not be empty and should be equal in size. Use this function to make a set of keys for joining when the left side and right side key fields are not equal.
maxMerge	int: maxMerge value (optional)	Previous maxMerge setting	Sets the join.maxMerge engine configuration setting. Returns the previous value of the join.maxMerge setting. See Engine Configuration Settings for more information.
maxRetries	int: maxRetries value (optional)	Previous maxRetries setting	Sets the maxRetries engine configuration setting. Returns the previous value of the maxRetries setting. See Engine Configuration Settings for more information.
minParallelism	int: minimumParallelism value (optional)	Previous minimumParallelism setting	Sets the minimumParallelism engine configuration setting. Returns the previous value of the minimumParallelism setting. See Engine Configuration Settings for more information.
monitored	boolean: monitored value (optional)	Previous monitored value	Sets the monitored engine configuration setting. Returns the previous value of the monitored setting. See Engine Configuration Settings for more information.
parallelism	int: parallelism value (optional)	Previous parallelism value	Sets the parallelism engine configuration setting. Returns the previous value of the parallelism setting. See Engine Configuration Settings for more information.
schedulerQueue	String: name of the scheduler queue to use when executing a job	String: previous scheduler queue name setting	Sets the name of the scheduler queue to use when scheduling jobs. The scheduler queue name is only valid when using a cluster for job execution. Currently scheduler queue names are only supported when using YARN for job execution.
schema		TextRecordBuilder	Creates a new TextRecordBuilder instance. This object can be used to define a new schema or load a previously defined schema.
setFileConfigurations	FileConfigurations:value		Sets the FileConfigurations instance that will be used.
sizeByReaders	boolean: sizeByReaders value (optional)	Previous sizeByReaders value	Sets the ports. sizeByReaders engine configuration setting. Returns the previous value of the ports.sizeByReaders setting. See Engine Configuration Settings for more information.
sortBuffer	String: sortBuffer value (optional)	Previous sortBuffer value	Sets the sort.sortBuffer engine configuration setting. Returns the previous value of the sort.sortBuffer setting. This value is set using a text value to represent the size. Use "k", "m" and "g" suffixes to represent kilobytes, megabytes and gigabytes, respectively. See Engine Configuration Settings for more information.
sortIOBuffer	String: sortIOBuffer value (optional)	Previous sortIOBuffer value	Sets the sort . sortIOBuffer engine configuration setting. Returns the previous value of the sort.sortIOBuffer setting. This value is set using a text value to represent the size. Use "k", "m" and "g" suffixes to represent kilobytes, megabytes and gigabytes, respectively. See Engine Configuration Settings for more information.
spoolThreshold	int: spoolThreshold value (optional)	Previous spoolThreshold value	Sets the ports.spoolThreshold engine configuration setting. Returns the previous value of the ports.spoolThreshold setting. See Engine Configuration Settings for more information.
storageManagementPath	String: storageManagementPath value	Previous storageManagementPath value	Sets the storageManagementPath engine configuration setting. Returns the previous value of the storageManagementPath setting. See Engine Configuration Settings for more information.
writeAhead	int: writeAhead value (optional)	Previous writeAhead value	Sets the ports.writeAhead engine configuration setting. Returns the previous value of the ports.writeAhead setting. See Engine Configuration Settings for more information.

Last modified date: 01/03/2025