Glossary

A function which groups the values of multiple rows as input based on a specific criterion and forms a single value of significant meaning or measurement.

A collection of systems with the DataFlow job distributed across the systems.

A uniquely identified system within a cluster. Typically, at least one partition is created for each node available in the cluster except when it is configured differently.

Composition is creation of application context, adding operators to that context, configuring the operators for execution, and linking the operator ports together. The composition will not execute the graph.

The ability to run independent units of work by using the same resources (CPU, disk, and memory) at the same time.

A worker process in a YARN-enabled DataFlow cluster. In a YARN-enabled cluster, a container is created for each partition.

Data that is used to verify the authenticity of the user. For DataFlow, it is the data used to establish the access rights for a file system.

The process of breaking the input data into smaller sets to perform the same task on each set in parallel.

A software architecture that is based on a directed, acyclic graph of functions with the edges indicating the movement of data.

A Java-based interface which provides the methods and specifications to compose and execute DataFlow-based applications.

DataFlow-based engine capable of parallel and distributed data operations. The core of DataFlow which drives the overall run time functionality of the DataFlow graphs and responsible for dead-lock and resource monitoring.

DataFlow nodes are connected together in KNIME to create a workflow that can perform data transfer, data loading, and data filtering for analytics.

The remote execution of a graph on a cluster.

The section of a function that is used to compute the function.

The procedure to run the physical graphs for processing the partitions of data assigned to the node in a cluster.

A communication port that allows another application to send or receive data from a DataFlow graph.

A data format external to DataFlow that is used to represent values.

The structure and representation of values in a data file.

The original targets that create multiple output sinks, and each fragment represents the portion of data in each partition.

For more information, see Scalar Valued Function.

The part of the distributed graph running locally on the machine executing it.

See Data Parallelism.

A workflow that is combined with the concrete configurations including cluster and engine settings.

The execution of a graph on the same Java Virtual Machine (JVM) as the invoking code.

The result of a DataFlow composition that describes the processing. The processing must be done without setting the mode to accomplish. The graph is created by adding operators and connecting them.

A node in DataFlow application graph. The nodes of a logical graph describe the transformations and functions to apply to the data flowing through the graph.

The point used to connect operators in a logical graph. This is done by linking the logical output of one operator to the logical input of another operator.

Defines certain attributes of the instance of an operator. Includes the required sort order and data partitioning, data availability and types, and parallelism.

A type of port that is used to communicate a single object in a DataFlow graph. Typically, the object that is passed represents a shared model of a certain kind.

A unique identified machine within a cluster.

The ability to break a unit of work into smaller sub-tasks that can be executed concurrently.

Each partition internal counter computed within a distributed graph. This is frequently used to reduce the amount of data that should be redistributed or sorted during an aggregation.

A subset of a data set that is divided into distinct independent parts.

The amount of appropriate work accomplished by a computer system as compared to the time and resources used.

A sequence of phases represented as DataFlow graphs that determine the actual data processing that is performed and the mode to accomplish.

A port that corresponds to a logical port in a logical operator. This also defines the implementation used to access the incoming or outgoing data.

A parallelism method that extends on simple task parallelism by breaking the task into a sequence of processing stages.

Predictive Model Markup Language

The end point connection of an operator. Operators can have input and output ports. Ports can be connected to link the operators in a graph.

The setting of an operator. Properties typically affect either the creation or run time behavior of an operator.

For more information, see Local Mode.

A type of port used to communicate streams of records by the DataFlow graph. This is the most common form of port, as most operators accept or produce records.

A simple scripting language based on JavaScript and provides a mode to write and execute DataFlow applications.

A measure of the change in performance as either available resources or the volume of input data changes.

A function that accepts a record as input and returns a single scalar value as output.

The fields of a record and specifically their names and data types.

An operator that can accept data from a DataFlow graph and store it for later use.

An operator that produces data and supply to a DataFlow graph.

A segment of a data file that spans a range of bytes. In DataFlow, the files are divided into number of splits that are assigned to different partitions to allow parallel reading.

The identity of an agent that performs action such as accessing a file and so on. A subject is used to perform access control checks.

The process of breaking the task into independent sub-tasks that can be performed concurrently.

The fundamental unit of data in DataFlow. Operators accept and create tokens, and send and receive them over ports.