Re: BAS_BULKREAD vs read stream - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BAS_BULKREAD vs read stream
Date
Msg-id dr4rjc4xewy5uf2dzywuq2fh6fnaydiwxexumjx3b6hkefatcn@kibyxaztit2i
Whole thread Raw
In response to Re: BAS_BULKREAD vs read stream  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: BAS_BULKREAD vs read stream
List pgsql-hackers
Hi,

On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:
> On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > I think we should consider increasing BAS_BULKREAD TO something like
> >   Min(256, io_combine_limit * (effective_io_concurrency + 1))
>
> Do you mean Max? If so, this basically makes sense to me.

Err, yes.

I was wondering whether we should add a Max(SYNC_SCAN_REPORT_INTERVAL, ...),
but it's a private value, and the proposed formula doesn't really change
anything for SYNC_SCAN_REPORT_INTERVAL. So I think it's fine.


> Overall, I think even though the ring is about reusing buffers, we
> have to think about how many IOs that reasonably is -- which this
> formula does.

Right - the prior limit kinda somewhat made sense before we had IO combining,
but after that *and* having AIO it is clearly obsoleted.


> You mentioned testing with 8MB, did you see some sort of clipp anywhere
> between 256 and 8MB?

There's not really a single cliff.

For buffered, fully cached IO:

With io_method=sync, it gets way better between 64 and 128kB, then gets worse
between 128kB and 256kB (the current value), and then seems to gradually gets
worse starting somewhere around 8MB. 32MB is 50% slower than 8MB...

io_method=worker is awful with 64-128kB, not great at 256kB and then is very
good. There's a 10% decline from 16MB->32MB.

io_method=io_uring is similar to sync at 64-128kB, very good from then on. I
do see a 6% decline from 16MB->32MB.


I suspect the 16-32MB cliff is due to L3 related effects, which is 13.8M per
per socket (of which I have 2). It's not entirely clear what that effect is -
all the additional cycles are spent in the kernel, not in userspace.  I
strongly suspect it's related to SMAP [1], but I don't really understand the
details. All I know is that disabling SMAP removes this cliff on several Intel
and AMD systems, both client and server CPUs.


For buffered, non-cached IO:

io_method=sync: I see no performance difference across all ring sizes.

io_method=worker: Performance is ~12% worse than sync at <= 256kB, 1.36x
faster at 512kB, 2.07x at 1MB, 3.0x at 4MB, and then it stays the same
up to 64MB.

io_method=io_uring: equivalent to sync at <= 256kB, 1.54x faster at 512kB,
3.2x faster at 4MB and stays the same up to 64MB.


For DIO/unbuffered IO:

As io_method=sync, obviously, doesn't do DIO/unbuffered IO in a reasonable
way, it doesn't make sense to compare it. So I'm comparing to buffered IO.

io_method=worker: Performance is terrifyingly bad at 128kB (like 0.41x the
throughput of buffered IO), slightly worse than buffered at 256kB, Best perf
is reached at 4MB and stays very consistent after that.

io_method=uring: Performance is terrifyingly bad at <= 256kB (like 0.43x the
throughput of buffered IO) and starts to be decent after that. Best perf is
reached at 4MB and stays very consistent after that.

The peak perf of buffered but uncached IO and DIO is rather close, as
I'm testing this on a PCIe3 drive.

The difference in CPU cycles is massive though:


worker buffered:

          9,850.27 msec cpu-clock                        #    3.001 CPUs utilized
           305,050      context-switches                 #   30.969 K/sec
            51,049      cpu-migrations                   #    5.182 K/sec
            11,530      page-faults                      #    1.171 K/sec
    16,615,532,455      instructions                     #    0.84  insn per cycle              (30.72%)
    19,876,584,840      cycles                           #    2.018 GHz                         (30.75%)
     3,256,065,951      branches                         #  330.556 M/sec                       (30.78%)
        26,046,144      branch-misses                    #    0.80% of all branches             (30.81%)
     4,452,808,846      L1-dcache-loads                  #  452.050 M/sec                       (30.83%)
       574,304,216      L1-dcache-load-misses            #   12.90% of all L1-dcache accesses   (30.82%)
       169,117,254      LLC-loads                        #   17.169 M/sec                       (30.82%)
        82,769,152      LLC-load-misses                  #   48.94% of all LL-cache accesses    (30.82%)
       377,137,247      L1-icache-load-misses                                                   (30.78%)
     4,475,873,620      dTLB-loads                       #  454.391 M/sec                       (30.76%)
         5,496,266      dTLB-load-misses                 #    0.12% of all dTLB cache accesses  (30.73%)
         9,765,507      iTLB-loads                       #  991.395 K/sec                       (30.70%)
         7,525,173      iTLB-load-misses                 #   77.06% of all iTLB cache accesses  (30.70%)

       3.282465335 seconds time elapsed

worker DIO:
          9,783.05 msec cpu-clock                        #    3.000 CPUs utilized
           356,102      context-switches                 #   36.400 K/sec
            32,575      cpu-migrations                   #    3.330 K/sec
             1,245      page-faults                      #  127.261 /sec
     8,076,414,780      instructions                     #    1.00  insn per cycle              (30.73%)
     8,109,508,194      cycles                           #    0.829 GHz                         (30.73%)
     1,585,426,781      branches                         #  162.058 M/sec                       (30.74%)
        17,869,296      branch-misses                    #    1.13% of all branches             (30.78%)
     2,199,974,033      L1-dcache-loads                  #  224.876 M/sec                       (30.79%)
       167,855,899      L1-dcache-load-misses            #    7.63% of all L1-dcache accesses   (30.79%)
        31,303,238      LLC-loads                        #    3.200 M/sec                       (30.79%)
         2,126,825      LLC-load-misses                  #    6.79% of all LL-cache accesses    (30.79%)
       322,505,615      L1-icache-load-misses                                                   (30.79%)
     2,186,161,593      dTLB-loads                       #  223.464 M/sec                       (30.79%)
         3,892,051      dTLB-load-misses                 #    0.18% of all dTLB cache accesses  (30.79%)
        10,306,643      iTLB-loads                       #    1.054 M/sec                       (30.77%)
         6,279,217      iTLB-load-misses                 #   60.92% of all iTLB cache accesses  (30.74%)

       3.260901966 seconds time elapsed


io_uring buffered:

          9,924.48 msec cpu-clock                        #    2.990 CPUs utilized
           340,821      context-switches                 #   34.341 K/sec
            57,048      cpu-migrations                   #    5.748 K/sec
             1,336      page-faults                      #  134.617 /sec
    16,630,629,989      instructions                     #    0.88  insn per cycle              (30.74%)
    18,985,579,559      cycles                           #    1.913 GHz                         (30.64%)
     3,253,081,357      branches                         #  327.784 M/sec                       (30.67%)
        24,599,858      branch-misses                    #    0.76% of all branches             (30.68%)
     4,515,979,721      L1-dcache-loads                  #  455.035 M/sec                       (30.69%)
       556,041,180      L1-dcache-load-misses            #   12.31% of all L1-dcache accesses   (30.67%)
       160,198,962      LLC-loads                        #   16.142 M/sec                       (30.65%)
        75,164,349      LLC-load-misses                  #   46.92% of all LL-cache accesses    (30.65%)
       348,585,830      L1-icache-load-misses                                                   (30.63%)
     4,473,414,356      dTLB-loads                       #  450.746 M/sec                       (30.91%)
         1,193,495      dTLB-load-misses                 #    0.03% of all dTLB cache accesses  (31.04%)
         5,507,512      iTLB-loads                       #  554.942 K/sec                       (31.02%)
         2,973,177      iTLB-load-misses                 #   53.98% of all iTLB cache accesses  (31.02%)

       3.319117422 seconds time elapsed

io_uring DIO:

          9,782.99 msec cpu-clock                        #    3.000 CPUs utilized
            96,916      context-switches                 #    9.907 K/sec
                 8      cpu-migrations                   #    0.818 /sec
             1,001      page-faults                      #  102.320 /sec
     5,902,978,172      instructions                     #    1.45  insn per cycle              (30.73%)
     4,059,940,112      cycles                           #    0.415 GHz                         (30.73%)
     1,117,690,786      branches                         #  114.248 M/sec                       (30.75%)
        10,994,087      branch-misses                    #    0.98% of all branches             (30.77%)
     1,559,149,686      L1-dcache-loads                  #  159.374 M/sec                       (30.78%)
        85,057,280      L1-dcache-load-misses            #    5.46% of all L1-dcache accesses   (30.78%)
        11,393,236      LLC-loads                        #    1.165 M/sec                       (30.78%)
         2,599,701      LLC-load-misses                  #   22.82% of all LL-cache accesses    (30.79%)
       174,124,990      L1-icache-load-misses                                                   (30.80%)
     1,545,148,685      dTLB-loads                       #  157.942 M/sec                       (30.79%)
           156,524      dTLB-load-misses                 #    0.01% of all dTLB cache accesses  (30.79%)
         3,325,307      iTLB-loads                       #  339.907 K/sec                       (30.77%)
         2,288,730      iTLB-load-misses                 #   68.83% of all iTLB cache accesses  (30.74%)

       3.260716339 seconds time elapsed

I'd say a 4.5x reduction in cycles is rather nice :)



> > I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
> > couldn't come up with any benefits. It seems to hurt fairly quickly.
>
> So, how will you deal with it when the BAS_BULKREAD ring is bigger?

I think I would just leave it at the current value. What I meant with "hurt
fairly quickly" is that *increasing* SYNC_SCAN_REPORT_INTERVAL seems to make
synchronize_seqscans work even less well.

Greetings,

Andres Freund

[1] https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention



pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: track generic and custom plans in pg_stat_statements
Next
From: Hannu Krosing
Date:
Subject: Horribly slow pg_upgrade performance with many Large Objects