Concepts to Know
Introduction to DataFlow
Overview
Performance and Scalability
Concurrency and Parallelism
Data Parallelism
Task Parallelism
Pipeline Parallelism
Dataflow
Data Security Model
File Systems
Hadoop
Encryption
Temporary Storage
Databases
Networks
Application Model
Applications in DataFlow
Graphs
Operators
Ports
External Ports
Execution Modes
Overview
Local Mode
Distributed Mode
Cluster Manager
Node Manager
Executor
Graph Client
Example: Data Processing in DataFlow
Example
Query
Corresponding Physical Graphs (Distributed)
Corresponding Physical Graphs (Pseudo-distributed)
Distributed Execution
Phase One Processing
Output of FilterRows
Output of DeriveFields
Output of Group(partial)
Phase Two Processing
Pipelining within Phases
Functions
Operators as Functions
ScalarValuedFunction
Composing Functions
Evaluating Functions
Validating a Function
Binding an Evaluator
Expression Language
DataFlow Core Expression Language
Basic Syntax
Field References
Literals
Numeric Literals
String Literals
Null Values
Predicate Operators
Arithmetic Operators
Conditional Operators
Parentheses
Core DataFlow Functions
Constant Value Functions
Conversion Functions
Date and Time Functions
Formatting Functions
Math Functions
Statistics Functions
String Functions
List Functions
Map Functions
Matching Functions
Aggregate Functions
Extending the Expression Language
Tokens and Types
Overview
Scalar Token Types
Object Token Types
Numeric Token Types
Date, Time, and Interval Token Types
Record Token Types
Defining Record Types
Token Types and Related Java Types
Token Containers
TokenValued Hierarchy
TokenSettable Hierarchy
Concrete Implementations
Big Picture
Token Type Compatibility
Compatibility Based on Token Type Hierarchy
Type Conversions for Token Values
Conversions to Objects
Conversion to String Form for Debugging
Comparing Token Values
Equality and Ordering of Null
Comparing Number Types
Input/Output
File Access
File Systems
Paths
File Patterns
File Client
Parallel Inout/Output
Reading in Parallel
Compressed Files
Writing in Parallel
Formats and Schemas
Formats
Format Discovery
Schemas
External Types
Null Indicators
String Conversion Behavior
Schema Construction
Schema Defaults
Schema Discovery
YARN on Hadoop Cluster
YARN Concepts
DataFlow Cluster Manager
Supported Files, Formats, and Text Types
Supported Files and Formats
Supported Compression Formats
gzip
bzip2
snappy
Supported File Systems
Local Files
Process Streams
Hadoop Distributed File System (HDFS)
Amazon S3
FTP, FTPS, or SFTP
Azure Blob Storage
Google Cloud Platform (GCP)
Supported Text Types
String Types
Number Types
Date, Time, and Interval Types
Boolean Types
IP Types
Enumeration Types
Binary Types
Character Types
Concepts to Know
Character Types