BAS_BULKREAD vs read stream - Mailing list pgsql-hackers

From Andres Freund
Subject BAS_BULKREAD vs read stream
Date
Msg-id lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Whole thread Raw
Responses Re: BAS_BULKREAD vs read stream
Re: BAS_BULKREAD vs read stream
List pgsql-hackers
Hi,

There are two issues with BAS_BULKREAD interactions with read stream.  One for
17+ and one for 18+.


The 17+ issue:

While trying to analyze issues with BAS_BULKREAD's size vs AIO I was getting
really confusing results. After an embarassingly long time I figured out that
a good chunk of those were due to BAS_BULKREAD not actually limiting buffer
usage - we'd end up filling shared buffers despite using the strategy.


There are two reasons for that:

- read_stream.c doesn't account for the buffer it is returning from
  read_stream_next_buffer()

  If we subtract 1 from GetAccessStrategyPinLimit() this avenue for escaping
  the strategy is closed. The amount of "leakage" reduces very substantially.


- read_stream.c doesn't know about buffer pins held externally. For sequential
  scans the buffer pin held by the slot passed to heap_getnextslot() will
  occasionally cause a buffer to be still pinned when the strategy wants to
  reuse the buffer, preventing reuse.

  This causes a slower "leak", but it's still rather substantial.

  Obviously there can be plans that pin buffers for longer, but that's
  relatively rare and nothing new. Whereas leaking buffer even with a plain
  seqscan is new in 17.


This issue exists in 17, but is harder to hit, because we don't actually read
ahead with sequential scans. If we just do a single read at a time, we don't
pin more than io_combine_limit IOs. The default io_combine_limit is 16,
whereas BAS_BULKREAD is 32 buffers with the default BLCKSZ=8192. As soon as
one uses io_combine_limit=32 the issue reproduces on 17.

With AIO this obviously becomes easier to hit.


I think medium term we need to overhaul how strategies work rather
substantially. The whole concept isn't quite right [1].

But for now we need a solution that's doable for both 17 and 18.  While not
pretty, I think the best short term solution is to just subtract 2 from the
strategy limit. That seems to work for all common uses of BAS_BULKREAD
currently.

I suspect that we should subtract one in read_stream.c (to account for the
buffer returned by read_stream_next_buffer()) and one in
GetAccessStrategyPinLimit() (to account for pins held externally).


The 18 issue:

Right now there's a noticeable performance drop when going from a seqscan that
doesn't use BAS_BULKREAD to one that uses BAS_BULKREAD. This is very
pronounced when not hitting the OS page cache, but noticeable even otherwise.

The reason for that slowdown is that BAS_BULKREAD is so small that it just
allows 1-2 IOs in flight (io_combine_limit = 16 / 128kB vs BAS_BULKREAD of
256kB). Whereas, with a smaller table that doesn't hit BAS_BULKREAD, we can
have 16 concurrent with the default settings, assuming a large enough shared
buffers (on very small s_b we'll limit how many buffers we pin).

The obvious solution to that would be to increase BAS_BULKREAD substantially
above 256kB.


For quite a while was worried about increasing the size, because somewhere (I
couldn't find it while writing this email, will add the reference once I
refound it) we have a comment explaining that a small size was chosen because
it helps with CPU cache efficiency.

Initially I could indeed see reasonably sized effects, but a good chunk of
that turned out to be the issue above, where small sizes simply wouln't
actually use the ring and thus have completely different performance
characteristics.

I *can* easily reproduce effects of doing very large reads when the data is in
the page cache, which indeed seems to be related to cache effects, which seems
to be related to some CPU oddities around crossing kernel/user memory, see
[3]. But that's a separate issue from the ring size itself.


I also can reproduce some negative effects of using larger ring sizes for
io_method=sync.  A 10GB pg_prewarm(), that I extended with the ability to use
a strategy of a certain size, slows down ~22% when going from 8MB to
16MB.   Obviously io_method=sync does not benefit from a larger ring size, as
it just executes IO's synchronous, which is not limited by the ring size.

The slowdown for io_method=worker for very small ring sizes is substantial,
whether the buffer is in the page cache or not. It simply limits the
asynchronizity too much. At 256kB io_method=worker is ~50% slower than
io_method=sync, at 512 it's 20% faster, at 1MB io_method=worker is 2.5x
faster.

With io_uring there is is a 7% regression at 128kB (lower than the current
default), at 256kB io_uring is 5% faster, reaching 1.9x at 3MB.


I think we should consider increasing BAS_BULKREAD TO something like
  Min(256, io_combine_limit * (effective_io_concurrency + 1))


As I said earlier, I think we need to redesign strategies, but this seems to
address the regression when going from no-strategy to strategy, without
causing any meaninful regressions.


I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
couldn't come up with any benefits. It seems to hurt fairly quickly.


Greetings,

Andres Freund


[1]

The whole idea of reusing buffers after a set distance is close to nonsensical
for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned
anymore), for cache efficiency. The only reason deferring the reuse makes
*some* sense is [2].

It doesn't even really make sense for the other BAS_'s. What we should control
there is the amount of dirty buffers and how to flush WAL at a sensible
rate. But with BAS_VACUUM that doesn't work at all, as it's used for both
reads and writes. If vacuum does't modify buffers, we can end up flushing WAL
way too frequently and if we only rarely modify, the amount of buffers in the
ring is too big.


[2]

For synchronize_seqscans to work, the BAS_BULKREAD size needs to be
smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way
too small with IO rates achievable on even a cheap laptop. Cheap storage can
do reads at gigabytes/second and one backend can process many hundreds of MB
each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the
BAS_BULKREAD makes it very likely that we have to re-read the buffer from the
OS.  That's bad with buffered IO, catastrophic with direct IO.

On a local NVMe SSDs on a large table with the normal BAS_BULKREAD I often get
effectively *no* buffer hits when running two concurrent seqscans, even if I
disable query parallelism.

Sometimes in the 10s, sometimes in the low thousands, always < 0.1%.  At 16MB
I get 25% hits, at 128MB it's close to 50% (the max you could get).

With parallelism I see very low hit rates even with 16MB, only around but with
256MB it starts to get better.


[3] https://postgr.es/m/yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq%403r54w5zqldwn



pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: AIO v2.5
Next
From: Nathan Bossart
Date:
Subject: Re: Update Unicode data to Unicode 16.0.0