Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Trouble with hashagg spill I/O pattern and costing |
Date | |
Msg-id | 20200519151202.u2p2gpiawoaznsv2@development Whole thread Raw |
Responses |
Re: Trouble with hashagg spill I/O pattern and costing
Re: Trouble with hashagg spill I/O pattern and costing |
List | pgsql-hackers |
Hi, I've been running some TPC-H benchmarks on master, to check if there's something unexpected, and I ran into some annoying issues with Q17 and Q20. I'll use Q17 as it's a bit simpler. I think there are two related problem - with costing and with excessive I/O due to using logical tapes. Let's talk about the costing first. On 75GB scale (with disabled parallel query), the execution plan looks like this: QUERY PLAN --------------------------------------------------------------------------------------------------------------- Limit (cost=16997740.10..16997740.12 rows=1 width=32) -> Aggregate (cost=16997740.10..16997740.12 rows=1 width=32) -> Nested Loop (cost=14204895.82..16997574.11 rows=66397 width=8) Join Filter: (part.p_partkey = lineitem.l_partkey) -> Hash Join (cost=14204895.25..16251060.84 rows=6640 width=40) Hash Cond: (lineitem_1.l_partkey = part.p_partkey) -> HashAggregate (cost=13977751.34..15945557.39 rows=6206695 width=36) Group Key: lineitem_1.l_partkey Planned Partitions: 128 -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) -> Hash (cost=227058.33..227058.33 rows=6846 width=4) -> Seq Scan on part (cost=0.00..227058.33 rows=6846 width=4) Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar)) -> Index Scan using idx_lineitem_part_supp on lineitem (cost=0.57..112.30 rows=10 width=17) Index Cond: (l_partkey = lineitem_1.l_partkey) Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity)))) (16 rows) and if I disale hash aggregate (or spill to disk), it changes to this: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Limit (cost=44577524.39..44577524.40 rows=1 width=32) -> Aggregate (cost=44577524.39..44577524.40 rows=1 width=32) -> Merge Join (cost=41772792.17..44577358.39 rows=66397 width=8) Merge Cond: (lineitem_1.l_partkey = part.p_partkey) Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity)))) -> GroupAggregate (cost=41772791.17..43305665.51 rows=6206695 width=36) Group Key: lineitem_1.l_partkey -> Sort (cost=41772791.17..42252715.81 rows=191969856 width=9) Sort Key: lineitem_1.l_partkey -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) -> Materialize (cost=1.00..1191105.89 rows=205371 width=21) -> Nested Loop (cost=1.00..1190592.46 rows=205371 width=21) -> Index Scan using part_pkey on part (cost=0.43..329262.21 rows=6846 width=4) Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar)) -> Index Scan using idx_lineitem_part_supp on lineitem (cost=0.57..125.51 rows=31 width=17) Index Cond: (l_partkey = part.p_partkey) (16 rows) The problem is that the hashagg plan runs in ~1400 seconds, while the groupagg only takes ~360. And per explain analyze, the difference really is in the aggregation - if we subtract the seqscan, the sort+groupagg takes about 310s: -> GroupAggregate (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192 rows=6398981loops=1) Group Key: lineitem_1.l_partkey -> Sort (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393 rows=191969841loops=1) Sort Key: lineitem_1.l_partkey Sort Method: external merge Disk: 3569544kB -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.019..28253.076rows=192000551 loops=1) while the hashagg takes ~1330s: -> HashAggregate (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897 rows=6400000loops=1) Group Key: lineitem_1.l_partkey Planned Partitions: 128 Peak Memory Usage: 4249 kB Disk Usage: 26321840 kB HashAgg Batches: 16512 -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.007..22205.617rows=192000551 loops=1) And that's while only writing 26GB, compared to 35GB in the sorted plan, and with cost being ~16M vs. ~43M (so roughly inverse). OK, let's make the hashagg plan more expensive - that'll fix it, right?. But how do you do that? I might lower the work_mem, say from 4MB to 1MB, which gets us from ~16M -> HashAggregate (cost=13977751.34..15945557.39 rows=6206695 width=36) Group Key: lineitem_1.l_partkey Planned Partitions: 128 -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) to ~20M (I'm a bit surprised that the planned partitions dropped 4x, but I suspect there's an explanation for that). -> HashAggregate (cost=17727162.59..20632321.45 rows=6206695 width=36) Group Key: lineitem_1.l_partkey Planned Partitions: 32 -> Seq Scan on lineitem lineitem_1 (cost=0.00..5519079.56 rows=191969856 width=9) Anyway, this did not really solve anything, apparently. The cost is still much lower than for groupagg, and moreover I don't want to lower work_mem - I want to increase cost for a given work_mem value. And it also increases the sort cost estimate, of course. As I'll show in a minute, I believe most of this is due to I/O pattern for the logical tapes, which is very different between sort and hashagg. So it'd be natural to consider seq_page_cost/random_page_cost on the temp tablespace, but that's not how it works - we just ignore that :-( Why do I think this is due to a difference in I/O pattern on the logical tape set? I've moved the temporary tablespace to a separate SSD device, and used iosnoop [1] to collect all I/O requests associated with this query. Attached are four charts showing blocks (sectors) accessed over time, both for the groupagg and hashagg plans. 1) sort + groupagg For groupagg (tempio-sort.png) the chart looks a bit chaotic, but it's reasonable - it shows the sort does merges, etc. Nothing particularly surprising, IMHO. It's also interesting to look at statistics of block sizes, and deltas of the blocks, for different request types. Showing the most common block sizes look something like this (the last column is percentage of all requests with the same request type): type | bytes | count | pct ------+---------+-------+------- RA | 131072 | 26034 | 59.92 RA | 16384 | 6160 | 14.18 RA | 8192 | 3636 | 8.37 RA | 32768 | 3406 | 7.84 RA | 65536 | 3270 | 7.53 RA | 24576 | 361 | 0.83 ... W | 1310720 | 8070 | 34.26 W | 262144 | 1213 | 5.15 W | 524288 | 1056 | 4.48 W | 1056768 | 689 | 2.93 W | 786432 | 292 | 1.24 W | 802816 | 199 | 0.84 ... The writes are buffered and so are done by kworkers, which seem to be able to coalesce them into fairly large chunks (e.g. 34% are 1280kB). The reads come from the postgres backend, and generally are 128kB reads. The deltas (in 512B sectors) are mostly consistent with this: type | block_delta | count | pct ------+-------------+-------+------- RA | 256 | 13432 | 30.91 RA | 16 | 3291 | 7.57 RA | 32 | 3272 | 7.53 RA | 64 | 3266 | 7.52 RA | 128 | 2877 | 6.62 RA | 1808 | 1278 | 2.94 RA | -2320 | 483 | 1.11 RA | 28928 | 386 | 0.89 ... W | 2560 | 7856 | 33.35 W | 2064 | 4921 | 20.89 W | 2080 | 586 | 2.49 W | 30960 | 300 | 1.27 W | 2160 | 253 | 1.07 W | 1024 | 248 | 1.05 ... I believe this suggests most of the I/O is pretty sequential. E.g. 31% of the reads are 256 sectors (128kB) apart, which is proportional to the 128kB reads. 2) hashagg The I/O pattern is illustrated by the tempion-hash.png chart, and it's clearly very different from the sort one. We're reading over and over in a zig-zag way. I'm pretty sure there are ~128 cycles matching the number of partitions in the explain analyze output, which end up being interleaved in the temporary file. But even at the partition level this is not very very sequential - there are two "zoom" charts showing smaller parts in more detail, and there's very obvious nested zig-zag pattern. Also, let's look at the block / delta stats: type | bytes | count | pct ------+---------+---------+-------- RA | 8192 | 3087724 | 95.42 RA | 24576 | 69511 | 2.15 RA | 16384 | 49297 | 1.52 RA | 32768 | 15589 | 0.48 ... W | 8192 | 321089 | 65.72 W | 16384 | 74097 | 15.17 W | 24576 | 27785 | 5.69 W | 1310720 | 16860 | 3.45 W | 32768 | 13823 | 2.83 W | 40960 | 7771 | 1.59 W | 49152 | 4767 | 0.98 ... Well, that's not great - we're not really coalescing writes or reads, everything is pretty much 8kB block. Especially the writes are somewhat surprising/concerning, because it shows the kernel is unable to combine the requests etc. The deltas look very different too: type | block_delta | count | pct ------+-------------+-------+------- RA | 2016 | 72399 | 2.24 RA | 2032 | 72351 | 2.24 RA | 1984 | 72183 | 2.23 RA | 2000 | 71964 | 2.22 RA | 2048 | 71718 | 2.22 RA | 2064 | 71387 | 2.21 RA | 1968 | 71363 | 2.21 RA | 1952 | 70412 | 2.18 RA | 2080 | 70189 | 2.17 RA | 2096 | 69568 | 2.15 RA | 1936 | 69109 | 2.14 RA | 1920 | 67660 | 2.09 RA | 2112 | 67248 | 2.08 RA | 1904 | 66026 | 2.04 ... There's no clear winner matching the block size, or anything. In fact, it does oscillate around 2000 sectors, i.e. 1MB. And 128 partitions multiplied by 8kB block per partition is ... 1MB (tadaaaa!). This however makes any read-ahead attempts ineffective :-( And let me repeat - this is on a machine with temp tablespace moved to an SSD, so the random I/O is not entirely terrible. On a different box with temp tablespace on 3x SATA RAID, the impact is much worse. This kinda makes me question whether logical tapes are the right tool for hashagg. I've read the explanation in logtape.c why it's about the same amount of I/O as using separate files, but IMO that only really works for I/O patters similar to merge sort - the more I think about this, the more I'm convinced we should just do what hashjoin is doing. But maybe I'm wrong, and logical tapes are the best thing we can do here. But in that case I think we need to improve the costing, so that it reflects the very different I/O pattern. [1] https://github.com/brendangregg/perf-tools/blob/master/iosnoop -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: