Re: Default setting for enable_hashagg_disk - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Default setting for enable_hashagg_disk
Date
Msg-id CAH2-WzkhNFqSeTmzfYKNUi-6hXxH1uL=1yZLh2uEO18wWWn94g@mail.gmail.com
Whole thread Raw
In response to Re: Default setting for enable_hashagg_disk  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Default setting for enable_hashagg_disk  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Sat, Jul 25, 2020 at 10:23 AM Jeff Davis <pgsql@j-davis.com> wrote:
> There's also another effect at work that can cause the total number of
> batches to be higher for larger work_mem values: when we do recurse, we
> again need to estimate the number of partitions needed. Right now, we
> overestimate the number of partitions needed (to be conservative),
> which leads to a wider fan-out and lots of tiny partitions, and
> therefore more batches.

What worries me a bit is the sharp discontinuities when spilling with
significantly less work_mem than the "optimal" amount. For example,
with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
that setting work_mem to 6MB looks like this:

   ->  HashAggregate  (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=21039.788..32278.703 rows=2000000 loops=1)
         Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
         Group Key: lineitem.l_partkey
         Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
Usage: 2478080kB  HashAgg Batches: 128

(And we have a sensible looking number of batches that match the
number of planned partitions with higher work_mem settings, too.)

However, if I set work_mem to 5MB (or less), it looks like this:

    ->  HashAggregate  (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=20849.490..37027.533 rows=2000000 loops=1)
         Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
         Group Key: lineitem.l_partkey
         Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
Usage: 2482152kB  HashAgg Batches: 11456

So the number of partitions is still 128, but the number of batches
explodes to 11,456 all at once. My guess that this is because the
recursive hash aggregation misbehaves in a self-similar fashion once a
certain tipping point has been reached. I expect that the exact nature
of that tipping point is very complicated, and generally dependent on
the dataset, clustering, etc. But I don't think that this kind of
effect will be uncommon.

(FWIW this example requires ~620MB work_mem to complete without
spilling at all -- so it's kind of extreme, though not quite as
extreme as many of the similar test results from Tomas.)

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Difference for Binary format vs Text format for client-server communication
Next
From: Peter Geoghegan
Date:
Subject: Re: hashagg slowdown due to spill changes