Ah-hah. I made a table similar to yours (with a lot of dummy data) and
went trawling through the backend memory to try to see where the space
was going. I found two significant inefficiencies in HashAggregate's
space usage:
* It stores a "representative tuple" for each input group, containing
the grouping fields needed to identify the group. Or at least that's
the theory. What I saw in this example (HashAggregate over a direct
table SeqScan) is that what was getting stored was raw disk tuples
including *all* the table columns not only the needed ones. This is
doubtless because the optimization that skips a projection step when
not needed at a table-scan node is firing inappropriately. This was
only costing one extra integer field in my cut-down example, but it
might've accounted for significant overhead in your case, and in the
general case it could be horribly bad. Even if the projection were
being done properly, I think we'd be storing copies of the input
columns used to compute the aggregates, not only the grouping columns.
So there's probably an easy fix here that could be back-patched into
existing releases, and a tenser fix that will save more space.
* dynahash.c is allocating new hashtable entries 32 at a time, without
any awareness for the fact that palloc() rounds small requests up to the
next power-of-2 size. In the example I was looking at, it was asking
for 1280 bytes at a time, resulting in almost 40% of the space used
being completely wasted. This is pretty trivial to fix.
Aside from being just plain inefficient, neither of these effects are
being accounted for in the planner's estimate of space needed for a hash
aggregation, and thus they could be contributing to the problem of
underestimated table size leading to out-of-memory failures.
I'm taking off for the evening but will look into fixing these soon.
regards, tom lane