Re: Default setting for enable_hashagg_disk - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Default setting for enable_hashagg_disk
Date
Msg-id 7d250899c8f2fcbf8030db4e778f39033b94c4ee.camel@j-davis.com
Whole thread Raw
In response to Re: Default setting for enable_hashagg_disk  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, 2020-07-23 at 19:33 -0700, Peter Geoghegan wrote:
> That does make it sound like the costs of the hash agg aren't being
> represented. I suppose it isn't clear if this is a costing issue
> because it isn't clear if the execution time performance itself is
> pathological or is instead something that must be accepted as the
> cost
> of spilling the hash agg in a general kind of way.

I have a feeling that this is mostly a costing problem. Sort uses its
memory in two different phases:

  1. when writing the sorted runs, it needs the memory to hold the run
before sorting it, and only a single buffer for the output tape; and
  2. when merging, it needs a lot of read buffers

But in HashAgg, it needs to hold all of the groups in memory *at the
same time* as it needs a lot of output buffers (one for each
partition). This doesn't matter a lot at high values of work_mem,
because the buffers will only be 8MB at most.

I did attempt to cost this properly: hash_agg_set_limits() takes into
account the memory the partitions will use, and the remaining memory is
what's used in cost_agg(). But there's a lot of room for error in
there.

If someone sees an obvious error in the costing, please let me know.
Otherwise, I think it will just take some time to make it better
reflect reality in a variety of cases. For v13, and we will either need
to live with it, or pessimize the costing for HashAgg until we get it
right.

Many costing issues can deal with a lot of slop -- e.g. HashJoin vs
MergeJoin -- because a small factor often doesn't make the difference
between plans. But HashAgg and Sort are more competitive with each
other, so costing needs to be more precise.

Regards,
    Jeff Davis





pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: estimation problems for DISTINCT ON with FDW
Next
From: Peter Eisentraut
Date:
Subject: Re: Difference for Binary format vs Text format for client-server communication