Thread: BAS_BULKREAD vs read stream

BAS_BULKREAD vs read stream

From
Andres Freund
Date:
Hi,

There are two issues with BAS_BULKREAD interactions with read stream.  One for
17+ and one for 18+.


The 17+ issue:

While trying to analyze issues with BAS_BULKREAD's size vs AIO I was getting
really confusing results. After an embarassingly long time I figured out that
a good chunk of those were due to BAS_BULKREAD not actually limiting buffer
usage - we'd end up filling shared buffers despite using the strategy.


There are two reasons for that:

- read_stream.c doesn't account for the buffer it is returning from
  read_stream_next_buffer()

  If we subtract 1 from GetAccessStrategyPinLimit() this avenue for escaping
  the strategy is closed. The amount of "leakage" reduces very substantially.


- read_stream.c doesn't know about buffer pins held externally. For sequential
  scans the buffer pin held by the slot passed to heap_getnextslot() will
  occasionally cause a buffer to be still pinned when the strategy wants to
  reuse the buffer, preventing reuse.

  This causes a slower "leak", but it's still rather substantial.

  Obviously there can be plans that pin buffers for longer, but that's
  relatively rare and nothing new. Whereas leaking buffer even with a plain
  seqscan is new in 17.


This issue exists in 17, but is harder to hit, because we don't actually read
ahead with sequential scans. If we just do a single read at a time, we don't
pin more than io_combine_limit IOs. The default io_combine_limit is 16,
whereas BAS_BULKREAD is 32 buffers with the default BLCKSZ=8192. As soon as
one uses io_combine_limit=32 the issue reproduces on 17.

With AIO this obviously becomes easier to hit.


I think medium term we need to overhaul how strategies work rather
substantially. The whole concept isn't quite right [1].

But for now we need a solution that's doable for both 17 and 18.  While not
pretty, I think the best short term solution is to just subtract 2 from the
strategy limit. That seems to work for all common uses of BAS_BULKREAD
currently.

I suspect that we should subtract one in read_stream.c (to account for the
buffer returned by read_stream_next_buffer()) and one in
GetAccessStrategyPinLimit() (to account for pins held externally).


The 18 issue:

Right now there's a noticeable performance drop when going from a seqscan that
doesn't use BAS_BULKREAD to one that uses BAS_BULKREAD. This is very
pronounced when not hitting the OS page cache, but noticeable even otherwise.

The reason for that slowdown is that BAS_BULKREAD is so small that it just
allows 1-2 IOs in flight (io_combine_limit = 16 / 128kB vs BAS_BULKREAD of
256kB). Whereas, with a smaller table that doesn't hit BAS_BULKREAD, we can
have 16 concurrent with the default settings, assuming a large enough shared
buffers (on very small s_b we'll limit how many buffers we pin).

The obvious solution to that would be to increase BAS_BULKREAD substantially
above 256kB.


For quite a while was worried about increasing the size, because somewhere (I
couldn't find it while writing this email, will add the reference once I
refound it) we have a comment explaining that a small size was chosen because
it helps with CPU cache efficiency.

Initially I could indeed see reasonably sized effects, but a good chunk of
that turned out to be the issue above, where small sizes simply wouln't
actually use the ring and thus have completely different performance
characteristics.

I *can* easily reproduce effects of doing very large reads when the data is in
the page cache, which indeed seems to be related to cache effects, which seems
to be related to some CPU oddities around crossing kernel/user memory, see
[3]. But that's a separate issue from the ring size itself.


I also can reproduce some negative effects of using larger ring sizes for
io_method=sync.  A 10GB pg_prewarm(), that I extended with the ability to use
a strategy of a certain size, slows down ~22% when going from 8MB to
16MB.   Obviously io_method=sync does not benefit from a larger ring size, as
it just executes IO's synchronous, which is not limited by the ring size.

The slowdown for io_method=worker for very small ring sizes is substantial,
whether the buffer is in the page cache or not. It simply limits the
asynchronizity too much. At 256kB io_method=worker is ~50% slower than
io_method=sync, at 512 it's 20% faster, at 1MB io_method=worker is 2.5x
faster.

With io_uring there is is a 7% regression at 128kB (lower than the current
default), at 256kB io_uring is 5% faster, reaching 1.9x at 3MB.


I think we should consider increasing BAS_BULKREAD TO something like
  Min(256, io_combine_limit * (effective_io_concurrency + 1))


As I said earlier, I think we need to redesign strategies, but this seems to
address the regression when going from no-strategy to strategy, without
causing any meaninful regressions.


I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
couldn't come up with any benefits. It seems to hurt fairly quickly.


Greetings,

Andres Freund


[1]

The whole idea of reusing buffers after a set distance is close to nonsensical
for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned
anymore), for cache efficiency. The only reason deferring the reuse makes
*some* sense is [2].

It doesn't even really make sense for the other BAS_'s. What we should control
there is the amount of dirty buffers and how to flush WAL at a sensible
rate. But with BAS_VACUUM that doesn't work at all, as it's used for both
reads and writes. If vacuum does't modify buffers, we can end up flushing WAL
way too frequently and if we only rarely modify, the amount of buffers in the
ring is too big.


[2]

For synchronize_seqscans to work, the BAS_BULKREAD size needs to be
smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way
too small with IO rates achievable on even a cheap laptop. Cheap storage can
do reads at gigabytes/second and one backend can process many hundreds of MB
each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the
BAS_BULKREAD makes it very likely that we have to re-read the buffer from the
OS.  That's bad with buffered IO, catastrophic with direct IO.

On a local NVMe SSDs on a large table with the normal BAS_BULKREAD I often get
effectively *no* buffer hits when running two concurrent seqscans, even if I
disable query parallelism.

Sometimes in the 10s, sometimes in the low thousands, always < 0.1%.  At 16MB
I get 25% hits, at 128MB it's close to 50% (the max you could get).

With parallelism I see very low hit rates even with 16MB, only around but with
256MB it starts to get better.


[3] https://postgr.es/m/yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq%403r54w5zqldwn



Re: BAS_BULKREAD vs read stream

From
Melanie Plageman
Date:
On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:
>
> I think we should consider increasing BAS_BULKREAD TO something like
>   Min(256, io_combine_limit * (effective_io_concurrency + 1))

Do you mean Max? If so, this basically makes sense to me.
Overall, I think even though the ring is about reusing buffers, we
have to think about how many IOs that reasonably is -- which this
formula does.

You mentioned testing with 8MB, did you see some sort of clipp
anywhere between 256 and 8MB?

> I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
> couldn't come up with any benefits. It seems to hurt fairly quickly.

So, how will you deal with it when the BAS_BULKREAD ring is bigger?

- Melanie



Re: BAS_BULKREAD vs read stream

From
Andres Freund
Date:
Hi,

On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:
> On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > I think we should consider increasing BAS_BULKREAD TO something like
> >   Min(256, io_combine_limit * (effective_io_concurrency + 1))
>
> Do you mean Max? If so, this basically makes sense to me.

Err, yes.

I was wondering whether we should add a Max(SYNC_SCAN_REPORT_INTERVAL, ...),
but it's a private value, and the proposed formula doesn't really change
anything for SYNC_SCAN_REPORT_INTERVAL. So I think it's fine.


> Overall, I think even though the ring is about reusing buffers, we
> have to think about how many IOs that reasonably is -- which this
> formula does.

Right - the prior limit kinda somewhat made sense before we had IO combining,
but after that *and* having AIO it is clearly obsoleted.


> You mentioned testing with 8MB, did you see some sort of clipp anywhere
> between 256 and 8MB?

There's not really a single cliff.

For buffered, fully cached IO:

With io_method=sync, it gets way better between 64 and 128kB, then gets worse
between 128kB and 256kB (the current value), and then seems to gradually gets
worse starting somewhere around 8MB. 32MB is 50% slower than 8MB...

io_method=worker is awful with 64-128kB, not great at 256kB and then is very
good. There's a 10% decline from 16MB->32MB.

io_method=io_uring is similar to sync at 64-128kB, very good from then on. I
do see a 6% decline from 16MB->32MB.


I suspect the 16-32MB cliff is due to L3 related effects, which is 13.8M per
per socket (of which I have 2). It's not entirely clear what that effect is -
all the additional cycles are spent in the kernel, not in userspace.  I
strongly suspect it's related to SMAP [1], but I don't really understand the
details. All I know is that disabling SMAP removes this cliff on several Intel
and AMD systems, both client and server CPUs.


For buffered, non-cached IO:

io_method=sync: I see no performance difference across all ring sizes.

io_method=worker: Performance is ~12% worse than sync at <= 256kB, 1.36x
faster at 512kB, 2.07x at 1MB, 3.0x at 4MB, and then it stays the same
up to 64MB.

io_method=io_uring: equivalent to sync at <= 256kB, 1.54x faster at 512kB,
3.2x faster at 4MB and stays the same up to 64MB.


For DIO/unbuffered IO:

As io_method=sync, obviously, doesn't do DIO/unbuffered IO in a reasonable
way, it doesn't make sense to compare it. So I'm comparing to buffered IO.

io_method=worker: Performance is terrifyingly bad at 128kB (like 0.41x the
throughput of buffered IO), slightly worse than buffered at 256kB, Best perf
is reached at 4MB and stays very consistent after that.

io_method=uring: Performance is terrifyingly bad at <= 256kB (like 0.43x the
throughput of buffered IO) and starts to be decent after that. Best perf is
reached at 4MB and stays very consistent after that.

The peak perf of buffered but uncached IO and DIO is rather close, as
I'm testing this on a PCIe3 drive.

The difference in CPU cycles is massive though:


worker buffered:

          9,850.27 msec cpu-clock                        #    3.001 CPUs utilized
           305,050      context-switches                 #   30.969 K/sec
            51,049      cpu-migrations                   #    5.182 K/sec
            11,530      page-faults                      #    1.171 K/sec
    16,615,532,455      instructions                     #    0.84  insn per cycle              (30.72%)
    19,876,584,840      cycles                           #    2.018 GHz                         (30.75%)
     3,256,065,951      branches                         #  330.556 M/sec                       (30.78%)
        26,046,144      branch-misses                    #    0.80% of all branches             (30.81%)
     4,452,808,846      L1-dcache-loads                  #  452.050 M/sec                       (30.83%)
       574,304,216      L1-dcache-load-misses            #   12.90% of all L1-dcache accesses   (30.82%)
       169,117,254      LLC-loads                        #   17.169 M/sec                       (30.82%)
        82,769,152      LLC-load-misses                  #   48.94% of all LL-cache accesses    (30.82%)
       377,137,247      L1-icache-load-misses                                                   (30.78%)
     4,475,873,620      dTLB-loads                       #  454.391 M/sec                       (30.76%)
         5,496,266      dTLB-load-misses                 #    0.12% of all dTLB cache accesses  (30.73%)
         9,765,507      iTLB-loads                       #  991.395 K/sec                       (30.70%)
         7,525,173      iTLB-load-misses                 #   77.06% of all iTLB cache accesses  (30.70%)

       3.282465335 seconds time elapsed

worker DIO:
          9,783.05 msec cpu-clock                        #    3.000 CPUs utilized
           356,102      context-switches                 #   36.400 K/sec
            32,575      cpu-migrations                   #    3.330 K/sec
             1,245      page-faults                      #  127.261 /sec
     8,076,414,780      instructions                     #    1.00  insn per cycle              (30.73%)
     8,109,508,194      cycles                           #    0.829 GHz                         (30.73%)
     1,585,426,781      branches                         #  162.058 M/sec                       (30.74%)
        17,869,296      branch-misses                    #    1.13% of all branches             (30.78%)
     2,199,974,033      L1-dcache-loads                  #  224.876 M/sec                       (30.79%)
       167,855,899      L1-dcache-load-misses            #    7.63% of all L1-dcache accesses   (30.79%)
        31,303,238      LLC-loads                        #    3.200 M/sec                       (30.79%)
         2,126,825      LLC-load-misses                  #    6.79% of all LL-cache accesses    (30.79%)
       322,505,615      L1-icache-load-misses                                                   (30.79%)
     2,186,161,593      dTLB-loads                       #  223.464 M/sec                       (30.79%)
         3,892,051      dTLB-load-misses                 #    0.18% of all dTLB cache accesses  (30.79%)
        10,306,643      iTLB-loads                       #    1.054 M/sec                       (30.77%)
         6,279,217      iTLB-load-misses                 #   60.92% of all iTLB cache accesses  (30.74%)

       3.260901966 seconds time elapsed


io_uring buffered:

          9,924.48 msec cpu-clock                        #    2.990 CPUs utilized
           340,821      context-switches                 #   34.341 K/sec
            57,048      cpu-migrations                   #    5.748 K/sec
             1,336      page-faults                      #  134.617 /sec
    16,630,629,989      instructions                     #    0.88  insn per cycle              (30.74%)
    18,985,579,559      cycles                           #    1.913 GHz                         (30.64%)
     3,253,081,357      branches                         #  327.784 M/sec                       (30.67%)
        24,599,858      branch-misses                    #    0.76% of all branches             (30.68%)
     4,515,979,721      L1-dcache-loads                  #  455.035 M/sec                       (30.69%)
       556,041,180      L1-dcache-load-misses            #   12.31% of all L1-dcache accesses   (30.67%)
       160,198,962      LLC-loads                        #   16.142 M/sec                       (30.65%)
        75,164,349      LLC-load-misses                  #   46.92% of all LL-cache accesses    (30.65%)
       348,585,830      L1-icache-load-misses                                                   (30.63%)
     4,473,414,356      dTLB-loads                       #  450.746 M/sec                       (30.91%)
         1,193,495      dTLB-load-misses                 #    0.03% of all dTLB cache accesses  (31.04%)
         5,507,512      iTLB-loads                       #  554.942 K/sec                       (31.02%)
         2,973,177      iTLB-load-misses                 #   53.98% of all iTLB cache accesses  (31.02%)

       3.319117422 seconds time elapsed

io_uring DIO:

          9,782.99 msec cpu-clock                        #    3.000 CPUs utilized
            96,916      context-switches                 #    9.907 K/sec
                 8      cpu-migrations                   #    0.818 /sec
             1,001      page-faults                      #  102.320 /sec
     5,902,978,172      instructions                     #    1.45  insn per cycle              (30.73%)
     4,059,940,112      cycles                           #    0.415 GHz                         (30.73%)
     1,117,690,786      branches                         #  114.248 M/sec                       (30.75%)
        10,994,087      branch-misses                    #    0.98% of all branches             (30.77%)
     1,559,149,686      L1-dcache-loads                  #  159.374 M/sec                       (30.78%)
        85,057,280      L1-dcache-load-misses            #    5.46% of all L1-dcache accesses   (30.78%)
        11,393,236      LLC-loads                        #    1.165 M/sec                       (30.78%)
         2,599,701      LLC-load-misses                  #   22.82% of all LL-cache accesses    (30.79%)
       174,124,990      L1-icache-load-misses                                                   (30.80%)
     1,545,148,685      dTLB-loads                       #  157.942 M/sec                       (30.79%)
           156,524      dTLB-load-misses                 #    0.01% of all dTLB cache accesses  (30.79%)
         3,325,307      iTLB-loads                       #  339.907 K/sec                       (30.77%)
         2,288,730      iTLB-load-misses                 #   68.83% of all iTLB cache accesses  (30.74%)

       3.260716339 seconds time elapsed

I'd say a 4.5x reduction in cycles is rather nice :)



> > I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
> > couldn't come up with any benefits. It seems to hurt fairly quickly.
>
> So, how will you deal with it when the BAS_BULKREAD ring is bigger?

I think I would just leave it at the current value. What I meant with "hurt
fairly quickly" is that *increasing* SYNC_SCAN_REPORT_INTERVAL seems to make
synchronize_seqscans work even less well.

Greetings,

Andres Freund

[1] https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention



Re: BAS_BULKREAD vs read stream

From
Andres Freund
Date:
Hi,

On 2025-04-07 16:28:20 -0400, Andres Freund wrote:
> On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:
> > On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > I think we should consider increasing BAS_BULKREAD TO something like
> > >   Min(256, io_combine_limit * (effective_io_concurrency + 1))
> >
> > Do you mean Max? If so, this basically makes sense to me.
>
> Err, yes.

In the attached I implemented the above idea, with some small additional
refinements:

- To allow sync seqscans to work at all, we should only *add* to the 256kB
  that we currently have - otherwise all buffers in a ring will be undergoing
  IO, never allowing two synchronizing scans to actually use the buffers that
  the other scan has already read in.

  This also kind of obsoletes the + 1 in the formula above, although that is
  arguable, particularly for effective_io_concurrency=0.

- If the backend has a PinLimit() that won't allow io_combine_limit *
  effective_io_concurrency buffers to undergo IO, it doesn't make sense to
  make the ring bigger. At best it would waste space for the ring, at worst
  it'd make "ring escapes" inevitable - victim buffer search would always
  replace buffers that we have in the ring.

- the multiplication by (BLCKSZ / 1024) that I omitted above is actually
  included :)


I unfortunately think we do need *something* to address $subject for 18 - the
performance regression when increasing relation sizes is otherwise just too
big - it's trivial to find queries getting slower by more than 4x. On local,
low-latency NVMe storage - on network storage the regression will often be
bigger.

If we don't do something for 18, only consolation would be that the
performance when using the 256kB BAS_BULKREAD is rather close to the
performance one gets in 17, with or without without a strategy. But I don't
think that would make it less surprising that once your table grows sufficient
to use a strategy your IO throughput craters.


I've some local prototype for the 17/18 "strategy escape" issue, will work on
polishing that soon, unless you have something for that Thomas?

Greetings,

Andres Freund

Attachment

Re: BAS_BULKREAD vs read stream

From
Thomas Munro
Date:
On Tue, Apr 8, 2025 at 2:20 PM Andres Freund <andres@anarazel.de> wrote:
> In the attached I implemented the above idea, with some small additional
> refinements:

LGTM.

How I wish EXPLAIN would show some clues about strategies...



Re: BAS_BULKREAD vs read stream

From
Andres Freund
Date:
Hi,

On 2025-04-08 18:11:04 +1200, Thomas Munro wrote:
> On Tue, Apr 8, 2025 at 2:20 PM Andres Freund <andres@anarazel.de> wrote:
> > In the attached I implemented the above idea, with some small additional
> > refinements:
> 
> LGTM.

Thanks for checking.


> How I wish EXPLAIN would show some clues about strategies...

Indeed. There will be some interesting piercing of layers to make that work...

Greetings,

Andres Freund



Re: BAS_BULKREAD vs read stream

From
Jakub Wartak
Date:
On Sun, Apr 6, 2025 at 10:15 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
[..]
> The obvious solution to that would be to increase BAS_BULKREAD substantially
> above 256kB.
>
> For quite a while was worried about increasing the size, because somewhere (I
> couldn't find it while writing this email, will add the reference once I
> refound it) we have a comment explaining that a small size was chosen because
> it helps with CPU cache efficiency.

Hi, FWIW, I was trying to understand the scope of this change and
GetAccessStrategy() actually asks to go to
src/backend/storage/buffer/README which explains the logic behind the
old (pre-commit now) rationale and value. It says
```
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
cache, which makes transferring pages from OS cache to shared buffer cache
efficient.
```

-J.