Building DataFlow Applications
Building DataFlow Applications in Java
DataFlow Application Life Cycle
DataFlow Operator Library
Performing I/O Operations
Using ByteSource Implementations
Using ByteSink Implementations
Read Operators
Properties
Write Operators
Database Loaders
HBase Operators
Performing SQL-Like Operations
Executing Relational Operations
Using the Group Operator to Compute Aggregations
Filtering and Sampling Data Sets
Merging Data Sets
Using the Sort Operator to Sort Data Sets
Using the ProcessByGroup Operator to Process Data by Distinct Key Groups
Manipulating Records
DataFlow Operators for Manipulating Records
Using the DeriveFields Operator to Compute New Fields
Using the DiscoverEnums Operator to Discover Enumerated Data Types
Using the MergeFields Operator to Merge Fields
Using the RemoveFields Operator to Remove Fields
Using the RetainFields Operator to Retain Fields
Using the SelectFields Operator to Select Fields
Using the RemapFields Operator to Rename Fields
Using the SplitField Operator to Split Fields
Using the RowsToColumns Operator to Convert Rows to Columns (Pivot)
Using the ColumnsToRows Operator to Convert Columns to Rows (Unpivot)
Performing Data Analysis
Dataflow Analytics Operators
Association Rule Mining
Cluster Analysis
Decision Trees
Using the DiscoverDomain Operator to Discover Domains
Using the RunRScript Operator to Invoke R Scripts
Using the KNNClassifier Operator to Classify K-Nearest Neighbors
Naive Bayes within DataFlow
Predictive Modeling Markup Language (PMML)
Regression Analysis
Statistics
Statistics Operators
Text Processing
Performing Data Cleansing
DataFlow Cleansing Operators
Performing Data Matching
DataFlow Matching Operators
Data Matching
Data Clustering
Match Analysis
Asserting Conditions
DataFlow Assertion Operators
Capturing Data
Generating Data
Generating Additional Data Fields
Partitioning Data
Processing Rows Using User-defined Scripts
Using the RunScript Operator
Composing an Application
Examples
Creating a Logical Graph
Creating and Adding Operators
Setting Operator Properties
Overriding the Operator Parallelism
Connecting Operator Ports
Compiling an Application
Compilation Output
Using a LogicalGraphInstance
Executing an Application
Parallelization of I/O
Example
Difference Between Local and Distributed Execution
Using the DataFlow API
Getting Started with the API
Packaging Your Application
Using External Ports
Example of Using External Port
Building DataFlow Applications Using RushScript
Overview of Scripting
RushScript Life Cycle
Runtime in RushScript
RushScript Example
Using Scripting Against API
Benefits of Using RushScript
Using RushScript Against the Java API
Using the DataFlow Java API
Using DataFlow Scripting
Getting Started with DataFlow Scripting
Packaging Your Scripting Application
RushScript Variables
Using Special Variables
dr Variable
Operators
Additional Utitility Methods
Cluster Specifier
Schema Variables
Using Enumerated Types
Using Helper Classes
Composing a Graph
Basics of Graph Composition
Creating a LogicalGraph
Creating Operators
Setting Properties
Connecting Ports
Executing a Graph
Multiple Composition and Execution
Implicit Execution
Additional Scripting Features
Using the dr.include() Function
Defining Custom Operators
Registering Operators Explicitly within RushScript
Registering the Operators Implicitly
Accessing Record Data Directly
Creating Join Keys
Setting Engine Configuration Properties
Writing Messages to Standard Output
Use Case Scenario for RushScript
Joining Data
Running from the Command Line
Setting Engine Configurations
Compiling Applications
Invoking a Function
Integrating with IDE
Configuring DataFlow Execution Profiles
Configuring Hadoop Modules
Creating a RushScript Project
Inserting DataFlow Expressions
Running a RushScript Project
Execution Method
Customizing and Extending DataFlow Functionality
Overview
Custom Data Processing
Writing a Function
Overview
Implementing the Evaluator
Defining the Function
Using the Function
Advanced Function Techniques
Dynamic Return Types and Complex Type Constraints
Type-specific Evaluators
Variable Numbers of Arguments
Serialization
Writing an Aggregation
Overview
Example: Parallel Computation of Grouped Average
Raw data
Compute-partials
Redistribute/Sort (if necessary)
Combine-partials
Output-final
Aggregation Class Model
Aggregator Internals
Example: Writing a Custom Aggregation
Adding Expression Language Support
Writing an Operator
Overview
General Requirements and Best Practices
Structure of an Operator
Serialization
Customizing Operator Serialization
Writing an Executable Operator
Overview
Implementing Metadata
Execution Method
Example
Managing the Port Metadata
Physical Port States and End of Data
Control Flow Patterns for Execution
Exception Handling in ExecutableOperators
Working with Sparse Records
Writing a CompositeOperator
Overview
Defining Metadata
Composition
Example
Writing an IterativeOperator
Overview
Using Metadata
Example
Writing a DeferredCompositeOperator
Overview
Using Metadata
Composition
Example
Building DataFlow Applications
Writing a DeferredCompositeOperator