Home > mailing lists

Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Trouble with hashagg spill I/O pattern and costing
Date	May 21, 2020 00:12:55
Msg-id	20200521001255.kfaihp3afv6vy6uq@development Whole thread Raw
In response to	Re: Trouble with hashagg spill I/O pattern and costing (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Trouble with hashagg spill I/O pattern and costing
List	pgsql-hackers

Tree view

On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>
>I reproduced the problem on a smaller scale (330M groups, ~30GB of
>memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.
>
>Unpatched master:
>   Sort: 250s
>   HashAgg: 310s
>Patched master:
>   Sort: 245s
>   HashAgg: 262s
>
>That's a nice improvement for such a simple patch. We can tweak the
>number of blocks to preallocate, or do other things like double from a
>small number up to a maximum. Also, a proper patch would probably
>release the blocks back as free when the tape was rewound.
>
>As long as the number of block numbers to preallocate is not too large,
>I don't think we need to change the API. It seems fine for sort to do
>the same thing, even though there's not any benefit.
>

I gave it a try on the machine with temp tablespace on SSD, and I can
confirm it improves performance. I've tried with different work_mem
values and I've also increased the number of pre-allocated blocks to 64
and 128 blocks, and the numbers look like this:

master

                sort      hash
    ----------------------------
       4MB       335      1331
     128MB       220      1208


patched (32)

                sort      hash
    ----------------------------
       4MB       344       685
     128MB       217       641


patched (64)

                 sort      hash
    ----------------------------
       4MB       329       545
     128MB       214       493

patched (128)

                sort      hash
    ----------------------------
       4MB       331       478
     128MB       222       434


I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.

The I/O pattern changed significantly - it's not visible on the charts,
so I'm not attaching them. But the statistics of block sizes and "gaps"
are pretty clear.


size of I/O requests
--------------------

a) master

      type |  bytes  |  count  |  pct   
     ------+---------+---------+--------
      RA   |    8192 | 2905948 |  95.83
      RA   |   24576 |   63470 |   2.09
      RA   |   16384 |   40155 |   1.32
      W    |    8192 |  149295 |  52.85
      W    |   16384 |   51781 |  18.33
      W    |   24576 |   22247 |   7.88
      W    | 1310720 |   15493 |   5.48
      W    |   32768 |   11856 |   4.20

b) patched, 32 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 247686 |  41.75
      RA   |    8192 |  95746 |  16.14
      RA   |   16384 |  82314 |  13.87
      RA   |   32768 |  82146 |  13.85
      RA   |   65536 |  82126 |  13.84
      W    | 1310720 |  16815 |  52.19
      W    |  262144 |   3628 |  11.26
      W    |  524288 |   2791 |   8.66

c) patched, 64 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 213556 |  56.18
      RA   |    8192 |  47663 |  12.54
      RA   |   16384 |  39358 |  10.35
      RA   |   32768 |  39308 |  10.34
      RA   |   65536 |  39304 |  10.34
      W    | 1310720 |  18132 |  65.27
      W    |  524288 |   3722 |  13.40
      W    |  262144 |    581 |   2.09
      W    | 1048576 |    405 |   1.46
      W    |    8192 |    284 |   1.02

d) patched, 128 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 200816 |  70.93
      RA   |    8192 |  23640 |   8.35
      RA   |   16384 |  19324 |   6.83
      RA   |   32768 |  19279 |   6.81
      RA   |   65536 |  19273 |   6.81
      W    | 1310720 |  18000 |  65.91
      W    |  524288 |   2074 |   7.59
      W    | 1048576 |    660 |   2.42
      W    |    8192 |    409 |   1.50
      W    |  786432 |    354 |   1.30

Clearly, the I/O requests are much larger - both reads and writes
shifted from 8kB to much larger ones, and the larger the number of
blocks the more significant the shift is. This means the workload is
getting more "sequential" and the write combining / read-ahead becomes
more effective.


deltas between I/O requests
---------------------------

I'll only show reads to save space, it's about the same for writes.

a) master

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 569237 | 18.77
      RA   |         240 | 475182 | 15.67
      RA   |         272 | 437260 | 14.42
      RA   |         224 | 328604 | 10.84
      RA   |         288 | 293628 |  9.68
      RA   |         208 | 199530 |  6.58
      RA   |         304 | 181695 |  5.99
      RA   |         192 | 109472 |  3.61
      RA   |         320 | 105211 |  3.47
      RA   |         336 |  57423 |  1.89

b) patched, 32 blocks

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 165071 | 27.82
      RA   |          32 |  82129 | 13.84
      RA   |          64 |  82122 | 13.84
      RA   |         128 |  82077 | 13.83
      RA   |          16 |  82042 | 13.83
      RA   |        7440 |  45168 |  7.61
      RA   |        7952 |   9838 |  1.66

c) patched, 64 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 173737 | 45.70
      RA   |          32 |  39301 | 10.34
      RA   |          64 |  39299 | 10.34
      RA   |         128 |  39291 | 10.34
      RA   |          16 |  39250 | 10.32
      RA   |       15120 |  21202 |  5.58
      RA   |       15376 |   4448 |  1.17

d) patched, 128 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 180955 | 63.91
      RA   |          32 |  19274 |  6.81
      RA   |          64 |  19273 |  6.81
      RA   |         128 |  19264 |  6.80
      RA   |          16 |  19203 |  6.78
      RA   |       30480 |   9835 |  3.47

The way I understand it, this needs to be interpreted together with
block size stats - in a perfectly sequential workload the two stats
would match. For master that's clearly not the case - the most common
read request size is 8kB, but the most common delta is 128kB (256
sectors, which is the read-ahead for the SSD device). The patched
results are much closer, mostly thanks to switching to 128kB reads.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Thomas Munro
Date: 21 May 2020, 00:02:36
Subject: Re: Parallel Seq Scan vs kernel read ahead

From: Alexander Korotkov
Date: 21 May 2020, 00:17:31
Subject: Re: Operator class parameters and sgml docs

Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

Previous

Next