Example: Parallel Computation of Grouped Average

For this example, let's assume that this is the original data set and that it is initially partitioned into two subsets as follows:

Partition 1		Partition 2
City	Temperature	City	Temperature
Boston	91	Boston	82
Austin	89	Austin	96
Boston	82	Boston	79
Austin	97	Seattle	61
Boston	89	San Francisco	66
San Francisco	67	Austin	99
Seattle	74	Seattle	77
Austin	100	Seattle	79

In the first step, we compute partial sum and count for each of the data partitions as follows:

Partition 1			Partition 2
City	Sum(Temperature)	Count(Temperature)	City	Sum(Temperature)	Count(Temperature)
Boston	91+82+89=262	3	Boston	82+79=161	2
Austin	89+97+100=286	3	Austin	96+99=195	2
San Francisco	67	1	Seattle	61+77+79=217	3
Seattle	74	1	San Francisco	66	1

Note that in this example, the partitions output one row for each group. Depending on how many rows there are and how much memory is available, each partition may output multiple rows per group. The Group operator maintains an LRU cache of partials, keyed by group key so as to make a "best effort" to reduce the amount of data to redistribute in the next step.

Next, we redistribute by city such that all partials of the same city live within the same partition. The repartitioning is somewhat arbitrary as long as a city does not span partitions. For this example, we'll assign Boston and Austin to partition 1 and Seattle and San Francisco to partition 2. Furthermore, within each partition, we sort by city.

Partition 1			Partition 2
City	Sum(Temperature)	Count(Temperature)	City	Sum(Temperature)	Count(Temperature)
Austin	286	3	San Francisco	67	1
Austin	195	2	San Francisco	66	1
Boston	262	3	Seattle	74	1
Boston	161	2	Seattle	217	3

Note that the framework will skip this step if the input data is already distributed/sorted by the group keys.

Next, combine partials by group:

Partition 1			Partition 2
City	Sum(Temperature)	Count(Temperature)	City	Sum(Temperature)	Count(Temperature)
Austin	286+195=481	3+2=5	San Francisco	67+66=133	1+1=2
Boston	262+161=423	3+2=5	Seattle	74+217=291	1+3=4

Finally, we compute average by dividing sums by counts:

Partition 1		Partition 2
City	Avg(Temperature)	City	Avg(Temperature)
Austin	481/5=96.2	San Francisco	133/2=66.5
Boston	423/5=84.6	Seattle	291/4=72.75