Customizing and Extending DataFlow Functionality : Writing an Aggregation : Example: Parallel Computation of Grouped Average
 
Share this page                  
Example: Parallel Computation of Grouped Average
Consider the case of computing average temperature, grouped by city.
Raw data
For this example, let's assume that this is the original data set and that it is initially partitioned into two subsets as follows:
Partition 1
Partition 2
City
Temperature
City
Temperature
Boston
91
Boston
82
Austin
89
Austin
96
Boston
82
Boston
79
Austin
97
Seattle
61
Boston
89
San Francisco
66
San Francisco
67
Austin
99
Seattle
74
Seattle
77
Austin
100
Seattle
79
Compute-partials
In the first step, we compute partial sum and count for each of the data partitions as follows:
Partition 1
Partition 2
City
Sum(Temperature)
Count(Temperature)
City
Sum(Temperature)
Count(Temperature)
Boston
91+82+89=262
3
Boston
82+79=161
2
Austin
89+97+100=286
3
Austin
96+99=195
2
San Francisco
67
1
Seattle
61+77+79=217
3
Seattle
74
1
San Francisco
66
1
Note that in this example, the partitions output one row for each group. Depending on how many rows there are and how much memory is available, each partition may output multiple rows per group. The Group operator maintains an LRU cache of partials, keyed by group key so as to make a "best effort" to reduce the amount of data to redistribute in the next step.
Redistribute/Sort (if necessary)
Next, we redistribute by city such that all partials of the same city live within the same partition. The repartitioning is somewhat arbitrary as long as a city does not span partitions. For this example, we'll assign Boston and Austin to partition 1 and Seattle and San Francisco to partition 2. Furthermore, within each partition, we sort by city.
Partition 1
Partition 2
City
Sum(Temperature)
Count(Temperature)
City
Sum(Temperature)
Count(Temperature)
Austin
286
3
San Francisco
67
1
Austin
195
2
San Francisco
66
1
Boston
262
3
Seattle
74
1
Boston
161
2
Seattle
217
3
Note that the framework will skip this step if the input data is already distributed/sorted by the group keys.
Combine-partials
Next, combine partials by group:
Partition 1
Partition 2
City
Sum(Temperature)
Count(Temperature)
City
Sum(Temperature)
Count(Temperature)
Austin
286+195=481
3+2=5
San Francisco
67+66=133
1+1=2
Boston
262+161=423
3+2=5
Seattle
74+217=291
1+3=4
Output-final
Finally, we compute average by dividing sums by counts:
Partition 1
Partition 2
City
Avg(Temperature)
City
Avg(Temperature)
Austin
481/5=96.2
San Francisco
133/2=66.5
Boston
423/5=84.6
Seattle
291/4=72.75