Re: Disk-based hash aggregate's cost model - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Disk-based hash aggregate's cost model |
Date | |
Msg-id | 20200830150315.taldaoobpaiv3ot5@development Whole thread Raw |
In response to | Re: Disk-based hash aggregate's cost model (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Disk-based hash aggregate's cost model
Re: Disk-based hash aggregate's cost model |
List | pgsql-hackers |
On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote: >On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: >>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: >>>We have a Postgres 13 open item for Disk-based hash aggregate, which >>>is the only non-trivial open item. There is a general concern about >>>how often we get disk-based hash aggregation when work_mem is >>>particularly low, and recursion seems unavoidable. This is generally >>>thought to be a problem in the costing. >> >>We discussed two approaches to tweaking the cost model: >> >>1. Penalize HashAgg disk costs by a constant amount. It seems to be >>chosen a little too often, and we can reduce the number of plan >>changes. >> >>2. Treat recursive HashAgg spilling skeptically, and heavily penalize >>recursive spilling. >> >>The problem with approach #2 is that we have a default hash mem of 4MB, >>and real systems have a lot more than that. In this scenario, recursive >>spilling can beat Sort by a lot. >> > >I think the main issue is that we're mostly speculating what's wrong. >I've shared some measurements and symptoms, and we've discussed what >might be causing it, but I'm not really sure we know for sure. > >I really dislike (1) because it seems more like "We don't know what's >wrong so we'll penalize hashagg," kind of solution. A much more >principled solution would be to tweak the costing accordingly, not just >by adding some constant. For (2) it really depends if recursive spilling >is really the problem here. In the examples I shared, the number of >partitions/batches was very different, but the query duration was >mostly independent (almost constant). > I've decided to look at the costing a bit more closely today, and see why the costing is so much lower compared to sort/groupagg. I've used the same 32GB dataset and query as in [1]. I've repeated tests for all the work_mem values, and I see the number of batches are much lower, probably thanks to the HLL improvement: 2MB Planned: 64 Batches (old): 4160 Batches: 2977 4MB Planned: 128 Batches (old): 16512 Batches: 1569 8MB Planned: 256 Batches (old): 21488 Batches: 1289 64MB Planned: 32 Batches (old): 2720 Batches: 165 256MB Planned: 8 Batches (old): 8 Batches: 41 The impact on duration of the queries seems pretty negligible, though. The plans with work_mem=64MB look like this: 1) hashagg QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1) -> HashAggregate (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642 rows=6400000loops=1) Group Key: lineitem.l_partkey Planned Partitions: 32 Batches: 165 Memory Usage: 67857kB Disk Usage: 6802432kB -> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631 rows=192000551loops=1) Planning Time: 0.064 ms Execution Time: 213441.986 ms (7 rows) 2) groupagg QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1) -> GroupAggregate (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683 rows=6400000loops=1) Group Key: lineitem.l_partkey -> Sort (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727 rows=192000551loops=1) Sort Key: lineitem.l_partkey Sort Method: external merge Disk: 3742208kB -> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264 rows=192000551loops=1) Planning Time: 0.063 ms Execution Time: 180209.580 ms (9 rows) I still don't understand why the hashagg is costed so low compared to the sort (only about 1/3 of it), because both cases use exactly the same estimates internally. cost_tuplesort ends up with npages = 937455 nruns = 114.435396 input_bytes = 7679629440 log_runs = 1.0 while cost_agg uses pages_read = 937455 pages_written = 937455 relation_size = 7679629440; yet we end up with much lower estimate for hashagg. It however does seem to me this is mostly due to non-I/O costs, considered by cost_tuplesort and perhaps ignored by cost_agg? In particular, most of the sort cost comes from this *startup_cost = comparison_cost * tuples * LOG2(tuples); So I'm wondering if the hashagg is not ignoring similar non-I/O costs for the spilling case. In particular, the initial section computing startup_cost seems to ignore that we may need to so some of the stuff repeatedly - for example we'll repeat hash lookups for spilled tuples, and so on. The other thing is that sort seems to be doing only about half the physical I/O (as measured by iosnoop) compared to hashagg, even though the estimates of pages / input_bytes are exactly the same. For hashagg the iosnoop shows 5921MB reads and 7185MB writes, while sort only does 2895MB reads and 3655MB writes. Which kinda matches the observed sizes of temp files in the two cases, so the input_bytes for sort seems to be a bit overestimated. regards [1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: