DF 8.2 | Data Sorting Operator

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Data Sorting Operator

Was this helpful?

Data Sorting Operator

Sort Operator

The Sort operator is used to sort input data. It will order the data according to one or more specified sort keys.

The Sort operator is parallelized so as to be able to scale to large amounts of data possibly residing on multiple nodes. In order to scale, the Sort operator itself performs a partial sort. A partial sort is a sort in which each partition of the data is sorted but there exists no ordered relationship among the rows of different partitions. If the source data is already partitioned, the existing partitions will be sorted. If the source data is not partitioned, the data will first be partitioned in a round-robin fashion and then sorted.

Any operations that require a gather of input data will preserve sort order. A gather refers to bringing all the data to a single node. Therefore the data will represent a total ordering following a gather. A total ordering is one where all elements are sorted with respect to each other. Because any data sinks (writers) that write to a single file perform a gather, those data sinks will produce a total ordering. In the example that follows, we write to a single file and thus produce a total ordering. Note that the use of a single sink should only be used in cases where the output data is known to be relatively small. This is because gathering the data is inherently nonscalable.

The Sort operator is special because the framework often uses it implicitly, since most operators explicitly specify data distribution and data ordering. For example, the Join operator declares that its inputs must be sorted, and the framework automatically does that sorting, so you do not need to insert a sort before a join. However, you may need a sort before operators such as Run JavaScript, where execution of JavaScript code may have a data order dependency that the DataFlow execution environment is not aware of.

Here are other cases where you may want to use the Sort operator:

• When results are written to a file. Sort can be used just before the final writer operator to achieve the needed output order. If the data is already ordered according to the declared metadata, the framework automatically skips the Sort.

• To control performance settings for a sort that would otherwise be implicit. By using the Sort operator and choosing its settings (for example, sortBufferSize property), you override the implicit sort. If you want your settings used everywhere, the EngineConfig class provides a way to globally configure implicit sorts. See Engine Configuration Settings.

Code Example

The following example demonstrates the usage of the Sort operator. The data input to the operator is sorted by the field named LAST_NAME in ascending order and the field GRADE in descending order. By default, data is sorted in ascending order.

Using the Sort operator in Java

// Sort the input data by the LAST_NAME field in ascending order
// and the GRADE field in descending order.
Sort sorter = graph.add(new Sort());
sorter.setSortKeys(new SortKey[] {SortKey.asc("LAST_NAME"), SortKey.desc("GRADE")});

Using the Sort operator in RushScript

// Sort the input data by the LAST_NAME field in ascending order
// and the GRADE field in descending order.
var sortedData = dr.sort(data, {sortKeys:['LAST_NAME asc', 'GRADE desc']});

By default, data is sorted in ascending order. When you supply only one key field and the order is ascending, shortcuts can be taken. The following example demonstrates using a single sort key in default order.

Using a single, default order key in Java

Sort sorter = graph.add(new Sort());
sorter.setSortKeys("LAST_NAME");

Properties

The Sort operator provides the following properties.

Name	Type	Description
sortKeys	String...	The list of fields to sort by. In the case of multiple fields, data is ordered first by the first key, then by the second, and so on.
sortKeys	SortKey...	The list of sort keys to sort by. A SortKey contains the pair (fieldName, ordering) where fieldName is the name of a field in the input and ordering indicates whether ascending or descending order is required. In the case of multiple fields, data is ordered first by the first key, then by the second, and so on.
maxMerge	int	Controls the maximum number of run files to merge at one time. Small values will reduce memory usage during the merge phase but can force multiple merge passes over the runs. Large values decrease the number of passes, but at a cost of memory; each run being merged will require a buffer of ioBufferSize bytes. If not set or set to zero, this first defaults to EngineConfig.sort.maxMerge. If that is also unspecified or set to zero, maxMerge is calculated by the following formula: maxMerge = sortBufferSize / ioBufferSize
sortBuffer	String	Specifies an approximate cap on the amount of memory used by the sort. If the data is unable to fit into memory, it will be written to intermediate storage as required. Values are supplied as number strings, supporting an optional size suffix. The common suffixes K, M, and G are supported, having the expected meaning; suffixes are case-insensitive, so you may use lowercase. Omitting the suffix indicates the value is in bytes. Values are limited to the range from 1M to 2G. Values outside of this range will be adjusted to the nearest limit. If not specified or the value is zero, this first defaults to EngineConfig.sort.getSortBufferSize(). If that is also unspecified or the value is zero, this defaults to 100M.
sortBufferSize	long	Similar to sortBuffer, the only difference is that the value is specified as a long byte count rather than a string.
ioBuffer	String	Specifies the size of the memory buffers used for I/O operations on run files during sorting. Note that one buffer is required for each run being merged in the merge phase; this value impacts the total memory used during this phase. Values are supplied as number strings, supporting an optional size suffix. The common suffixes K, M, and G are supported, having the expected meaning; suffixes are case-insensitive. Omitting the suffix indicates the value is in bytes. If not specified or the value is zero, this first defaults to EngineConfig.sort.getIOBufferSize(). If that is also unspecified or its value is zero, this defaults to 64K.
ioBufferSize	long	Similar to ioBuffer, the only difference is that the value is specified as a long byte count rather than a string.

Note: The sort tuning properties (maxMerge, sortBufferSize, and ioBufferSize) can be controlled globally through the EngineConfig class. See Engine Configuration Settings for more information.

Ports

The Sort operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data to be sorted.

The Sort operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The sorted data.

Last modified date: 03/10/2025