Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 20200521001255.kfaihp3afv6vy6uq@development
Whole thread Raw
In response to Re: Trouble with hashagg spill I/O pattern and costing  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing
List pgsql-hackers
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>
>I reproduced the problem on a smaller scale (330M groups, ~30GB of
>memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.
>
>Unpatched master:
>   Sort: 250s
>   HashAgg: 310s
>Patched master:
>   Sort: 245s
>   HashAgg: 262s
>
>That's a nice improvement for such a simple patch. We can tweak the
>number of blocks to preallocate, or do other things like double from a
>small number up to a maximum. Also, a proper patch would probably
>release the blocks back as free when the tape was rewound.
>
>As long as the number of block numbers to preallocate is not too large,
>I don't think we need to change the API. It seems fine for sort to do
>the same thing, even though there's not any benefit.
>

I gave it a try on the machine with temp tablespace on SSD, and I can
confirm it improves performance. I've tried with different work_mem
values and I've also increased the number of pre-allocated blocks to 64
and 128 blocks, and the numbers look like this:

master

                sort      hash
    ----------------------------
       4MB       335      1331
     128MB       220      1208


patched (32)

                sort      hash
    ----------------------------
       4MB       344       685
     128MB       217       641


patched (64)

                 sort      hash
    ----------------------------
       4MB       329       545
     128MB       214       493

patched (128)

                sort      hash
    ----------------------------
       4MB       331       478
     128MB       222       434


I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.

The I/O pattern changed significantly - it's not visible on the charts,
so I'm not attaching them. But the statistics of block sizes and "gaps"
are pretty clear.


size of I/O requests
--------------------

a) master

      type |  bytes  |  count  |  pct   
     ------+---------+---------+--------
      RA   |    8192 | 2905948 |  95.83
      RA   |   24576 |   63470 |   2.09
      RA   |   16384 |   40155 |   1.32
      W    |    8192 |  149295 |  52.85
      W    |   16384 |   51781 |  18.33
      W    |   24576 |   22247 |   7.88
      W    | 1310720 |   15493 |   5.48
      W    |   32768 |   11856 |   4.20

b) patched, 32 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 247686 |  41.75
      RA   |    8192 |  95746 |  16.14
      RA   |   16384 |  82314 |  13.87
      RA   |   32768 |  82146 |  13.85
      RA   |   65536 |  82126 |  13.84
      W    | 1310720 |  16815 |  52.19
      W    |  262144 |   3628 |  11.26
      W    |  524288 |   2791 |   8.66

c) patched, 64 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 213556 |  56.18
      RA   |    8192 |  47663 |  12.54
      RA   |   16384 |  39358 |  10.35
      RA   |   32768 |  39308 |  10.34
      RA   |   65536 |  39304 |  10.34
      W    | 1310720 |  18132 |  65.27
      W    |  524288 |   3722 |  13.40
      W    |  262144 |    581 |   2.09
      W    | 1048576 |    405 |   1.46
      W    |    8192 |    284 |   1.02

d) patched, 128 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 200816 |  70.93
      RA   |    8192 |  23640 |   8.35
      RA   |   16384 |  19324 |   6.83
      RA   |   32768 |  19279 |   6.81
      RA   |   65536 |  19273 |   6.81
      W    | 1310720 |  18000 |  65.91
      W    |  524288 |   2074 |   7.59
      W    | 1048576 |    660 |   2.42
      W    |    8192 |    409 |   1.50
      W    |  786432 |    354 |   1.30

Clearly, the I/O requests are much larger - both reads and writes
shifted from 8kB to much larger ones, and the larger the number of
blocks the more significant the shift is. This means the workload is
getting more "sequential" and the write combining / read-ahead becomes
more effective.


deltas between I/O requests
---------------------------

I'll only show reads to save space, it's about the same for writes.

a) master

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 569237 | 18.77
      RA   |         240 | 475182 | 15.67
      RA   |         272 | 437260 | 14.42
      RA   |         224 | 328604 | 10.84
      RA   |         288 | 293628 |  9.68
      RA   |         208 | 199530 |  6.58
      RA   |         304 | 181695 |  5.99
      RA   |         192 | 109472 |  3.61
      RA   |         320 | 105211 |  3.47
      RA   |         336 |  57423 |  1.89

b) patched, 32 blocks

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 165071 | 27.82
      RA   |          32 |  82129 | 13.84
      RA   |          64 |  82122 | 13.84
      RA   |         128 |  82077 | 13.83
      RA   |          16 |  82042 | 13.83
      RA   |        7440 |  45168 |  7.61
      RA   |        7952 |   9838 |  1.66

c) patched, 64 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 173737 | 45.70
      RA   |          32 |  39301 | 10.34
      RA   |          64 |  39299 | 10.34
      RA   |         128 |  39291 | 10.34
      RA   |          16 |  39250 | 10.32
      RA   |       15120 |  21202 |  5.58
      RA   |       15376 |   4448 |  1.17

d) patched, 128 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 180955 | 63.91
      RA   |          32 |  19274 |  6.81
      RA   |          64 |  19273 |  6.81
      RA   |         128 |  19264 |  6.80
      RA   |          16 |  19203 |  6.78
      RA   |       30480 |   9835 |  3.47

The way I understand it, this needs to be interpreted together with
block size stats - in a perfectly sequential workload the two stats
would match. For master that's clearly not the case - the most common
read request size is 8kB, but the most common delta is 128kB (256
sectors, which is the read-ahead for the SSD device). The patched
results are much closer, mostly thanks to switching to 128kB reads.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Parallel Seq Scan vs kernel read ahead
Next
From: Alexander Korotkov
Date:
Subject: Re: Operator class parameters and sgml docs