Building DataFlow Applications
Building DataFlow Applications in Java
DataFlow Application Life Cycle
DataFlow Operator Library
ByteSource and ByteSink I/O Operators
About ByteSource Implementations
BasicByteSource Operator
ConcatenatedByteSource Operator
GlobbingByteSource Operator
About ByteSink Implementations
BasicByteSink Operator
Read I/O Operators
ReadAvro Operator
ReadORC Operator
ReadMDF Operator
ReadParquet Operator
ReadFromJDBC Operator
ReadDelimitedText Operator
ReadFixedText Operator
ReadSource Operator
ReadLog Operator
ReadARFF Operator
ReadStagingDataset Operator
ParseTextFields Operator
ReadJSON Operator
Write I/O Operators
WriteAvro Operator
WriteToJDBC Operator
BinaryWriter Operator
DeleteFromJDBC Operator
UpdateInJDBC Operator
WriteDelimitedText Operator
WriteFixedText Operator
WriteSink Operator
WriteStagingDataset Operator
WriteARFF Operator
ForceRecordStaging Operator
WriteORC Operator
Vector Operators
LoadActianVector Operator
ReadActianVector Operator
HBase Operators
DeleteHBase Operator
ReadHBase Operator
WriteHBase Operator
Data Aggregation Operator
Group Operator
Data Filtering and Sampling Operators
FilterRows Operator
FilterExistingRows Operator
LimitRows Operator
SampleRandomRows Operator
Data Merging Operators
Join Operator
CrossJoin Operator
UnionAll Operator
Data Sorting Operator
Sort Operator
Data Processing Operator
ProcessByGroup Operator
Data Manipulation Operators
DeriveFields Operator
DiscoverEnums Operator
MergeFields Operator
RemoveFields Operator
RetainFields Operator
SelectFields Operator
RemapFields Operator
SplitField Operator
RowsToColumns Operator
ColumnsToRows Operator
Association Rule Mining Operators
FrequentItems Operator
FPGrowth Operator
ConvertARMModel Operator
Cluster Analysis Operators
KMeans Operator
ClusterPredictor Operator
Decision Tree Operators
DecisionTreeLearner Operator
DecisionTreePredictor Operator
DecisionTreePruner Operator
Discover Domain Operator
DiscoverDomain Operator
Run R Script Operator
RunRScript Operator
KNNClassifier Operator
KNNClassifier Operator
Naive Bayes Operators
NaiveBayesLearner Operator
NaiveBayesPredictor Operator
PMML Operators
ReadPMML Operator
WritePMML Operator
Regression Analysis
LinearRegressionLearner Operator
RegressionPredictor Operator
LogisticRegressionLearner Operator
LogisticRegressionPredictor Operator
Statistical Operators
DataQualityAnalyzer Operator
SummaryStatistics Operator
DistinctValues Operator
NormalizeValues Operator
Rank Operator
SumOfSquares Operator
CountRanges Operator
EqualRangeBinning Operator
MostFrequentValues Operator
Support Vector Machine Operators
SVMLearner Operator
SVMPredictor Operator
Text Processing Operators
TextTokenizer Operator
CountTokens Operator
FilterText Operator
DictionaryFilter Operator
ConvertTextCase Operator
TextStemmer Operator
ExpandTextTokens Operator
CalculateWordFrequency Operator
CalculateNGramFrequency Operator
TextFrequencyFilter Operator
ExpandTextFrequency Operator
GenerateBagOfWords Operator
DrawDiagnosticsChart Operator
Data Cleansing Operators
RemoveDuplicates Operator
ReplaceMissingValues Operator
Data Matching Operators
DiscoverLinks Operator
Data Clustering Operators
ClusterDuplicates Operator
ClusterLinks Operator
Match Analysis Operators
AnalyzeDuplicateKeys Operator
AnalyzeLinkKeys Operator
Assertion Operators
AssertEqual Operator
AssertEqualTypes Operator
AssertPredicate Operator
AssertRowCount Operator
Ports
AssertSorted Operator
AssertEqualHash Operator
Data Capturing Operators
CollectRecords Operator
GetModel Operator
EmitRecords Operator
PutModel Operator
LogRows Operator
Generating Additional Data Operators
GenerateConstant Operator
GenerateRandom Operator
GenerateRepeatingCycle Operator
GenerateArithmeticSequence Operator
Partitioning Data Operators
PartitionHint Operator
GatherHint Operator
Randomize Operator
Script Processing Operators
RunScript Operator
RunJavaScript Operator
Advanced Graph Execution Operators
SubJobExecutor Operator
MockableExternalRecordSource Operator
MockableExternalRecordSink Operator
Composing an Application
Examples
Creating a Logical Graph
Creating and Adding Operators
Setting Operator Properties
Overriding the Operator Parallelism
Connecting Operator Ports
Compiling an Application
Compilation Output
Using a LogicalGraphInstance
Executing an Application
Parallelization of I/O
Example
Difference Between Local and Distributed Execution
Using the DataFlow API
Getting Started with the API
Packaging Your Application
Using External Ports
Example of Using External Port
Building DataFlow Applications Using RushScript
Overview of Scripting
RushScript Life Cycle
Runtime in RushScript
RushScript Example
Using Scripting Against API
Benefits of Using RushScript
Using RushScript Against the Java API
Using the DataFlow Java API
Using DataFlow Scripting
Getting Started with DataFlow Scripting
Packaging Your Scripting Application
RushScript Variables
Using Special Variables
dr Variable
Operators
Additional Utitility Methods
Cluster Specifier
Remote File Configuration
Schema Variables
Using Enumerated Types
Using Helper Classes
Composing a Graph
Basics of Graph Composition
Creating a LogicalGraph
Creating Operators
Setting Properties
Connecting Ports
Executing a Graph
Multiple Composition and Execution
Implicit Execution
Additional Scripting Features
Using the dr.include() Function
Defining Custom Operators
Registering Operators Explicitly within RushScript
Registering the Operators Implicitly
Accessing Record Data Directly
Creating Join Keys
Setting Engine Configuration Properties
Writing Messages to Standard Output
Use Case Scenario for RushScript
Joining Data
Running from the Command Line
Setting Engine Configurations
Compiling Applications
Invoking a Function
Integrating with IDE
Configuring DataFlow Execution Profiles
Configuring Hadoop Modules
Creating a RushScript Project
Inserting DataFlow Expressions
Running a RushScript Project
Execution Method
Customizing and Extending DataFlow Functionality
Overview
Custom Data Processing
Writing a Function
Overview
Implementing the Evaluator
Defining the Function
Using the Function
Advanced Function Techniques
Dynamic Return Types and Complex Type Constraints
Type-specific Evaluators
Variable Numbers of Arguments
Serialization
Writing an Aggregation
Overview
Example: Parallel Computation of Grouped Average
Raw data
Compute-partials
Redistribute/Sort (if necessary)
Combine-partials
Output-final
Aggregation Class Model
Aggregator Internals
Example: Writing a Custom Aggregation
Adding Expression Language Support
Writing an Operator
Overview
General Requirements and Best Practices
Structure of an Operator
Serialization
Customizing Operator Serialization
Writing an Executable Operator
Overview
Implementing Metadata
Execution Method
Example
Managing the Port Metadata
Physical Port States and End of Data
Control Flow Patterns for Execution
Exception Handling in ExecutableOperators
Working with Sparse Records
Writing a CompositeOperator
Overview
Defining Metadata
Composition
Example
Writing an IterativeOperator
Overview
Using Metadata
Example
Writing a DeferredCompositeOperator
Overview
Using Metadata
Composition
Example
Building DataFlow Applications
Writing a DeferredCompositeOperator