Customizing and Extending DataFlow Functionality : Writing an Aggregation : Aggregation Class Model
 
Share this page                  
Aggregation Class Model
In order to define a custom aggregation, you must define an AggregatorFactory and one or more Aggregators. An aggregator is the execution component that does the "work" of calculating and combining data. An Aggregator is intended to be stateful: it should keep sum, etc., within its instance variables. An AggregatorFactory is the composition piece that creates Aggregators based on input types. An AggregatorFactory is expected to be stateless: they should have no non-final instance variances. In addition, an AggregatorFactory must be JSON serializable.
Aggregator Internals
Aggregators should keep their internal state in one or more private fields. In addition, the internals must be representable as an array of ScalarTokenTypes.
Aggregators must define combineInternals and storeInternals in order to translate to and from the external array representation. The array representation is used for compact storage in the LRU cache that is used when computing partials for unsorted data. This allows the Group operator to reuse the same Aggregator object across multiple groups which avoids the overhead of storing a per-group object in the cache. In addition, the array representation is used to compactly encode partials within a DataFlow record during the redistribute/sort phase.
Aggregator Methods
The following methods must be implemented by a class implementing the Aggregator interface.
Method Name
Description
getOutputType
Called at initialization to determine the output type.
getInternalTypes
Called at initialization to determine the types of each of the internal fields. This determines the types that must be stored in the array representation.
setInputs
Called from the Group operator to provide the Aggregator with a handle to its input fields.
accumulate
Called by the Group operator during the compute-partials step for each input row. The Aggregator should add the values from the current row to its internal counters.
combineInternals
Called by the Group operator during the combine-partials step for each group of partial results. The Aggregator should add the supplied values to its internal counters.
storeInternals
Called by the Group operator during the compute-partials step. This is used to output partial rows and is used to store data into the group LRU cache when switching groups.
reset
Called by the Group operator to "zero-out" the internal counts. The Group operator will call reset followed by a combineInternals for cases where it needs to load internal values. Thus, there is not a separate loadInternals method to implement.
storeFinalResults
Called by the Group operator during the output-final step to output final results.