Re: Disk-based hash aggregate's cost model - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Disk-based hash aggregate's cost model
Date
Msg-id 20200830150315.taldaoobpaiv3ot5@development
Whole thread Raw
In response to Re: Disk-based hash aggregate's cost model  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Disk-based hash aggregate's cost model
Re: Disk-based hash aggregate's cost model
List pgsql-hackers
On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote:
>On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:
>>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
>>>We have a Postgres 13 open item for Disk-based hash aggregate, which
>>>is the only non-trivial open item. There is a general concern about
>>>how often we get disk-based hash aggregation when work_mem is
>>>particularly low, and recursion seems unavoidable. This is generally
>>>thought to be a problem in the costing.
>>
>>We discussed two approaches to tweaking the cost model:
>>
>>1. Penalize HashAgg disk costs by a constant amount. It seems to be
>>chosen a little too often, and we can reduce the number of plan
>>changes.
>>
>>2. Treat recursive HashAgg spilling skeptically, and heavily penalize
>>recursive spilling.
>>
>>The problem with approach #2 is that we have a default hash mem of 4MB,
>>and real systems have a lot more than that. In this scenario, recursive
>>spilling can beat Sort by a lot.
>>
>
>I think the main issue is that we're mostly speculating what's wrong.
>I've shared some measurements and symptoms, and we've discussed what
>might be causing it, but I'm not really sure we know for sure.
>
>I really dislike (1) because it seems more like "We don't know what's
>wrong so we'll penalize hashagg," kind of solution. A much more
>principled solution would be to tweak the costing accordingly, not just
>by adding some constant. For (2) it really depends if recursive spilling
>is really the problem here. In the examples I shared, the number of
>partitions/batches was very different, but the query duration was
>mostly independent (almost constant).
>

I've decided to look at the costing a bit more closely today, and see
why the costing is so much lower compared to sort/groupagg. I've used
the same 32GB dataset and query as in [1].

I've repeated tests for all the work_mem values, and I see the number of
batches are much lower, probably thanks to the HLL improvement:

       2MB    Planned:  64    Batches (old):  4160    Batches:  2977
       4MB    Planned: 128    Batches (old): 16512    Batches:  1569
       8MB    Planned: 256    Batches (old): 21488    Batches:  1289
      64MB    Planned:  32    Batches (old):  2720    Batches:   165
     256MB    Planned:   8    Batches (old):     8    Batches:    41

The impact on duration of the queries seems pretty negligible, though.


The plans with work_mem=64MB look like this:

1) hashagg

                                                                QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1)
    ->  HashAggregate  (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642
rows=6400000loops=1)
 
          Group Key: lineitem.l_partkey
          Planned Partitions: 32  Batches: 165  Memory Usage: 67857kB  Disk Usage: 6802432kB
          ->  Seq Scan on lineitem  (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631
rows=192000551loops=1)
 
  Planning Time: 0.064 ms
  Execution Time: 213441.986 ms
(7 rows)

2) groupagg

                                                                   QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1)
    ->  GroupAggregate  (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683
rows=6400000loops=1)
 
          Group Key: lineitem.l_partkey
          ->  Sort  (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727
rows=192000551loops=1)
 
                Sort Key: lineitem.l_partkey
                Sort Method: external merge  Disk: 3742208kB
                ->  Seq Scan on lineitem  (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264
rows=192000551loops=1)
 
  Planning Time: 0.063 ms
  Execution Time: 180209.580 ms
(9 rows)


I still don't understand why the hashagg is costed so low compared to
the sort (only about 1/3 of it), because both cases use exactly the same
estimates internally. cost_tuplesort ends up with

     npages = 937455
     nruns  = 114.435396
     input_bytes = 7679629440
     log_runs = 1.0

while cost_agg uses

     pages_read = 937455
     pages_written = 937455
     relation_size = 7679629440;

yet we end up with much lower estimate for hashagg. It however does seem
to me this is mostly due to non-I/O costs, considered by cost_tuplesort
and perhaps ignored by cost_agg? In particular, most of the sort cost
comes from this

     *startup_cost = comparison_cost * tuples * LOG2(tuples);

So I'm wondering if the hashagg is not ignoring similar non-I/O costs
for the spilling case. In particular, the initial section computing
startup_cost seems to ignore that we may need to so some of the stuff
repeatedly - for example we'll repeat hash lookups for spilled tuples,
and so on.

The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same. For hashagg
the iosnoop shows 5921MB reads and 7185MB writes, while sort only does
2895MB reads and 3655MB writes. Which kinda matches the observed sizes
of temp files in the two cases, so the input_bytes for sort seems to be
a bit overestimated.


regards

[1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Rare link canary failure in dblink test
Next
From: Tom Lane
Date:
Subject: Re: Row estimates for empty tables