Customizing and Extending DataFlow Functionality : Writing an Aggregation
 
Share this page                  
Writing an Aggregation
Overview
In addition to the built-in aggregations (average, sum, variance, etc.), the Group operator allows custom aggregations. This section describes how to write a custom aggregation for the Group operator. Before reading this section, you should read the section Using the Group Operator to Compute Aggregations for information on using the Group operator.
Before we examine the details of writing an aggregator, it's important to examine parallelism within the Group operator. At a high level, computation proceeds by first computing per-partition internal counters that we refer to as partials. The partials are then combined across partitions and then normalized to a final result.
More formally, any parallel group operation consists of the following steps:
1. Compute-partials: In this step, per-partition internal counters are calculated. For example, in the case of average, these would be the per-partition count and per-partition sum. The purpose of this step is to reduce the amount of initial data to redistribute/sort in the next step.
2. Redistribute/Sort: In this step, partials are redistributed by group key such that groups no longer span partitions. In addition, within each partition, data is sorted by group key so as to guarantee that rows of the same group are grouped together. Note that the redistribute/sort is skipped in the case where data is already partitioned/sorted by key.
3. Combine-partials: In this step, we combine the per-partition partial values. For example, in the case of average, we compute an overall sum equal to the sum of the partial sums and an overall count equal to the sum of the partial counts.
4. Output-final: In this step, we output the final aggregation.