Glossary
 
Share this page                  
Glossary
Aggregation
A function which groups the values of multiple rows as input based on a specific criterion and forms a single value of significant meaning or measurement.
Cluster
A collection of systems with the DataFlow job distributed across the systems.
Cluster node
A uniquely identified system within a cluster. Typically, at least one partition is created for each node available in the cluster except when it is configured differently.
Composition
Composition is creation of application context, adding operators to that context, configuring the operators for execution, and linking the operator ports together. The composition will not execute the graph.
Concurrency
The ability to run independent units of work by using the same resources (CPU, disk, and memory) at the same time.
Container
A worker process in a YARN-enabled DataFlow cluster. In a YARN-enabled cluster, a container is created for each partition.
Credentials
Data that is used to verify the authenticity of the user. For DataFlow, it is the data used to establish the access rights for a file system.
Data Parallelism
The process of breaking the input data into smaller sets to perform the same task on each set in parallel.
DataFlow
A software architecture that is based on a directed, acyclic graph of functions with the edges indicating the movement of data.
DataFlow API
A Java-based interface which provides the methods and specifications to compose and execute DataFlow-based applications.
DataFlow Engine
DataFlow-based engine capable of parallel and distributed data operations. The core of DataFlow which drives the overall run time functionality of the DataFlow graphs and responsible for dead-lock and resource monitoring.
DataFlow Node
DataFlow nodes are connected together in KNIME to create a workflow that can perform data transfer, data loading, and data filtering for analytics.
Distributed Mode
The remote execution of a graph on a cluster.
Evaluator
The section of a function that is used to compute the function.
Executor
The procedure to run the physical graphs for processing the partitions of data assigned to the node in a cluster.
External Port
A communication port that allows another application to send or receive data from a DataFlow graph.
External Type
A data format external to DataFlow that is used to represent values.
Format
The structure and representation of values in a data file.
Fragments
The original targets that create multiple output sinks, and each fragment represents the portion of data in each partition.
Function
For more information, see Scalar Valued Function.
Graph Client
The part of the distributed graph running locally on the machine executing it.
Horizontal Partitioning
See Data Parallelism.
Job
A workflow that is combined with the concrete configurations including cluster and engine settings.
Local Mode
The execution of a graph on the same Java Virtual Machine (JVM) as the invoking code.
Logical Graph
The result of a DataFlow composition that describes the processing. The processing must be done without setting the mode to accomplish. The graph is created by adding operators and connecting them.
Logical Operator
A node in DataFlow application graph. The nodes of a logical graph describe the transformations and functions to apply to the data flowing through the graph.
Logical Port
The point used to connect operators in a logical graph. This is done by linking the logical output of one operator to the logical input of another operator.
Metadata
Defines certain attributes of the instance of an operator. Includes the required sort order and data partitioning, data availability and types, and parallelism.
Model Port
A type of port that is used to communicate a single object in a DataFlow graph. Typically, the object that is passed represents a shared model of a certain kind.
Node
A unique identified machine within a cluster.
Parallelism
The ability to break a unit of work into smaller sub-tasks that can be executed concurrently.
Partial
Each partition internal counter computed within a distributed graph. This is frequently used to reduce the amount of data that should be redistributed or sorted during an aggregation.
Partition
A subset of a data set that is divided into distinct independent parts.
Performance
The amount of appropriate work accomplished by a computer system as compared to the time and resources used.
Physical Graph
A sequence of phases represented as DataFlow graphs that determine the actual data processing that is performed and the mode to accomplish.
Physical Port
A port that corresponds to a logical port in a logical operator. This also defines the implementation used to access the incoming or outgoing data.
Pipeline Parallelism
A parallelism method that extends on simple task parallelism by breaking the task into a sequence of processing stages.
PMML
Predictive Model Markup Language
Port
The end point connection of an operator. Operators can have input and output ports. Ports can be connected to link the operators in a graph.
Property
The setting of an operator. Properties typically affect either the creation or run time behavior of an operator.
Psuedo-distributed Mode
For more information, see Local Mode.
Record Port
A type of port used to communicate streams of records by the DataFlow graph. This is the most common form of port, as most operators accept or produce records.
RushScript
A simple scripting language based on JavaScript and provides a mode to write and execute DataFlow applications.
Scalability
A measure of the change in performance as either available resources or the volume of input data changes.
Scalar Valued Function
A function that accepts a record as input and returns a single scalar value as output.
Schema
The fields of a record and specifically their names and data types.
Sink
An operator that can accept data from a DataFlow graph and store it for later use.
Source
An operator that produces data and supply to a DataFlow graph.
Split or file split
A segment of a data file that spans a range of bytes. In DataFlow, the files are divided into number of splits that are assigned to different partitions to allow parallel reading.
Subject
The identity of an agent that performs action such as accessing a file and so on. A subject is used to perform access control checks.
Task Parallelism
The process of breaking the task into independent sub-tasks that can be performed concurrently.
Token
The fundamental unit of data in DataFlow. Operators accept and create tokens, and send and receive them over ports.
Vertical Partitioning
For more information, see Task Parallelism.