On Fri, Jul 24, 2020 at 1:40 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Maybe, not sure what exactly you think is pathological? The trouble is
> hashagg has to spill input tuples but the memory used in no-spill case
> represents aggregated groups, so I'm not sure how you could extrapolate
> from that ...
Yeah, but when hash agg enters spill mode it will continue to advance
the transition states for groups already in the hash table, which
could be quite a significant effect. The peak memory usage for an
equivalent no-spill hash agg is therefore kind of related to the
amount of I/O needed for spilling.
I suppose that you mostly tested cases where memory was in very short
supply, where that breaks down completely. Perhaps it wasn't helpful
for me to bring that factor into this discussion -- it's not as if
there is any doubt that hash agg is spilling a lot more here in any
case.
> Not sure, but I think we need to spill roughly as much as sort, so it
> seems a bit strange that (a) we're spilling 2x as much data and yet the
> cost is so much lower.
ISTM that the amount of I/O that hash agg performs can vary *very*
widely for the same data. This is mostly determined by work_mem, but
there are second order effects. OTOH, the amount of I/O that a sort
must do is practically fixed. You can quibble with that
characterisation a bit because of multi-pass sorts, but not really --
multi-pass sorts are generally quite rare.
I think that we need a more sophisticated cost model for this in
cost_agg(). Maybe the "pages_written" calculation could be pessimized.
However, it doesn't seem like this is precisely an issue with I/O
costs.
--
Peter Geoghegan