Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Trouble with hashagg spill I/O pattern and costing |
Date | |
Msg-id | 20200521001255.kfaihp3afv6vy6uq@development Whole thread Raw |
In response to | Re: Trouble with hashagg spill I/O pattern and costing (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Trouble with hashagg spill I/O pattern and costing
|
List | pgsql-hackers |
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote: >On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote: >> >> And if there a way to pre-allocate larger chunks? Presumably we could >> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x >> 8kB) >> instead of just single block. I haven't seen anything like that in >> tape.c, though ... > >It turned out to be simple (at least a POC) so I threw together a >patch. I just added a 32-element array of block numbers to each tape. >When we need a new block, we retrieve a block number from that array; >or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times. > >I reproduced the problem on a smaller scale (330M groups, ~30GB of >memory on a 16GB box). Work_mem=64MB. The query is a simple distinct. > >Unpatched master: > Sort: 250s > HashAgg: 310s >Patched master: > Sort: 245s > HashAgg: 262s > >That's a nice improvement for such a simple patch. We can tweak the >number of blocks to preallocate, or do other things like double from a >small number up to a maximum. Also, a proper patch would probably >release the blocks back as free when the tape was rewound. > >As long as the number of block numbers to preallocate is not too large, >I don't think we need to change the API. It seems fine for sort to do >the same thing, even though there's not any benefit. > I gave it a try on the machine with temp tablespace on SSD, and I can confirm it improves performance. I've tried with different work_mem values and I've also increased the number of pre-allocated blocks to 64 and 128 blocks, and the numbers look like this: master sort hash ---------------------------- 4MB 335 1331 128MB 220 1208 patched (32) sort hash ---------------------------- 4MB 344 685 128MB 217 641 patched (64) sort hash ---------------------------- 4MB 329 545 128MB 214 493 patched (128) sort hash ---------------------------- 4MB 331 478 128MB 222 434 I agree that's pretty nice. I wonder how far would we need to go before reaching a plateau. I'll try this on the other machine with temporary tablespace on SATA, but that'll take longer. The I/O pattern changed significantly - it's not visible on the charts, so I'm not attaching them. But the statistics of block sizes and "gaps" are pretty clear. size of I/O requests -------------------- a) master type | bytes | count | pct ------+---------+---------+-------- RA | 8192 | 2905948 | 95.83 RA | 24576 | 63470 | 2.09 RA | 16384 | 40155 | 1.32 W | 8192 | 149295 | 52.85 W | 16384 | 51781 | 18.33 W | 24576 | 22247 | 7.88 W | 1310720 | 15493 | 5.48 W | 32768 | 11856 | 4.20 b) patched, 32 blocks type | bytes | count | pct ------+---------+--------+-------- RA | 131072 | 247686 | 41.75 RA | 8192 | 95746 | 16.14 RA | 16384 | 82314 | 13.87 RA | 32768 | 82146 | 13.85 RA | 65536 | 82126 | 13.84 W | 1310720 | 16815 | 52.19 W | 262144 | 3628 | 11.26 W | 524288 | 2791 | 8.66 c) patched, 64 blocks type | bytes | count | pct ------+---------+--------+-------- RA | 131072 | 213556 | 56.18 RA | 8192 | 47663 | 12.54 RA | 16384 | 39358 | 10.35 RA | 32768 | 39308 | 10.34 RA | 65536 | 39304 | 10.34 W | 1310720 | 18132 | 65.27 W | 524288 | 3722 | 13.40 W | 262144 | 581 | 2.09 W | 1048576 | 405 | 1.46 W | 8192 | 284 | 1.02 d) patched, 128 blocks type | bytes | count | pct ------+---------+--------+-------- RA | 131072 | 200816 | 70.93 RA | 8192 | 23640 | 8.35 RA | 16384 | 19324 | 6.83 RA | 32768 | 19279 | 6.81 RA | 65536 | 19273 | 6.81 W | 1310720 | 18000 | 65.91 W | 524288 | 2074 | 7.59 W | 1048576 | 660 | 2.42 W | 8192 | 409 | 1.50 W | 786432 | 354 | 1.30 Clearly, the I/O requests are much larger - both reads and writes shifted from 8kB to much larger ones, and the larger the number of blocks the more significant the shift is. This means the workload is getting more "sequential" and the write combining / read-ahead becomes more effective. deltas between I/O requests --------------------------- I'll only show reads to save space, it's about the same for writes. a) master type | block_delta | count | pct ------+-------------+--------+------- RA | 256 | 569237 | 18.77 RA | 240 | 475182 | 15.67 RA | 272 | 437260 | 14.42 RA | 224 | 328604 | 10.84 RA | 288 | 293628 | 9.68 RA | 208 | 199530 | 6.58 RA | 304 | 181695 | 5.99 RA | 192 | 109472 | 3.61 RA | 320 | 105211 | 3.47 RA | 336 | 57423 | 1.89 b) patched, 32 blocks type | block_delta | count | pct ------+-------------+--------+------- RA | 256 | 165071 | 27.82 RA | 32 | 82129 | 13.84 RA | 64 | 82122 | 13.84 RA | 128 | 82077 | 13.83 RA | 16 | 82042 | 13.83 RA | 7440 | 45168 | 7.61 RA | 7952 | 9838 | 1.66 c) patched, 64 blocks type | block_delta | count | pct ------+-------------+--------+------- RA | 256 | 173737 | 45.70 RA | 32 | 39301 | 10.34 RA | 64 | 39299 | 10.34 RA | 128 | 39291 | 10.34 RA | 16 | 39250 | 10.32 RA | 15120 | 21202 | 5.58 RA | 15376 | 4448 | 1.17 d) patched, 128 blocks type | block_delta | count | pct ------+-------------+--------+------- RA | 256 | 180955 | 63.91 RA | 32 | 19274 | 6.81 RA | 64 | 19273 | 6.81 RA | 128 | 19264 | 6.80 RA | 16 | 19203 | 6.78 RA | 30480 | 9835 | 3.47 The way I understand it, this needs to be interpreted together with block size stats - in a perfectly sequential workload the two stats would match. For master that's clearly not the case - the most common read request size is 8kB, but the most common delta is 128kB (256 sectors, which is the read-ahead for the SSD device). The patched results are much closer, mostly thanks to switching to 128kB reads. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: