Re: Get rid of WALBufMappingLock - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Get rid of WALBufMappingLock
Date
Msg-id CAPpHfdtOBgoUfGiw9exZPpS7e4EE7SdEBRt9VDfgYJpC2Jd5DA@mail.gmail.com
Whole thread Raw
In response to Re: Get rid of WALBufMappingLock  (Yura Sokolov <y.sokolov@postgrespro.ru>)
List pgsql-hackers
On Mon, Mar 31, 2025 at 1:42 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> 14.03.2025 17:30, Tomas Vondra wrote:
> > Hi,
> >
> > I've briefly looked at this patch this week, and done a bit of testing.
> > I don't have any comments about the correctness - it does seem correct
> > to me and I haven't noticed any crashes/issues, but I'm not familiar
> > with the WALBufMappingLock enough to have insightful opinions.
> >
> > I have however decided to do a bit of benchmarking, to better understand
> > the possible benefits of the change. I happen to have access to an Azure
> > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> > can do ~1.5GB/s.
> >
> > The benchmark script (attached) uses the workload mentioned by Andres
> > some time ago [1]
> >
> >    SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> >
> > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> > results look like this (this is throughput):
> >
> >            |  8                 |  64                |  1024
> >   clients  |  master   patched  |  master   patched  |  master  patched
> >   ---------------------------------------------------------------------
> >         1  |   11864     12035  |    7419      7345  |     968      940
> >         4  |   26311     26919  |   12414     12308  |    1304     1293
> >         8  |   38742     39651  |   14316     14539  |    1348     1348
> >        16  |   57299     59917  |   15405     15871  |    1304     1279
> >        32  |   74857     82598  |   17589     17126  |    1233     1233
> >        48  |   87596     95495  |   18616     18160  |    1199     1227
> >        64  |   89982     97715  |   19033     18910  |    1196     1221
> >        96  |   92853    103448  |   19694     19706  |    1190     1210
> >       128  |   95392    103324  |   20085     19873  |    1188     1213
> >       160  |   94933    102236  |   20227     20323  |    1180     1214
> >       196  |   95933    103341  |   20448     20513  |    1188     1199
> >
> > To put this into a perspective, this throughput relative to master:
> >
> >   clients  |     8      64     1024
> >   ----------------------------------
> >         1  |  101%     99%      97%
> >         4  |  102%     99%      99%
> >         8  |  102%    102%     100%
> >        16  |  105%    103%      98%
> >        32  |  110%     97%     100%
> >        48  |  109%     98%     102%
> >        64  |  109%     99%     102%
> >        96  |  111%    100%     102%
> >       128  |  108%     99%     102%
> >       160  |  108%    100%     103%
> >       196  |  108%    100%     101%
> >
> > That does not seem like a huge improvement :-( Yes, there's 1-10%
> > speedup for the small (8K) size, but for larger chunks it's a wash.
> >
> > Looking at the pgbench progress, I noticed stuff like this:
> >
> > ...
> > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> > ...
> >
> > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> > things (it's a virtual SSD, I'm not sure what model is that behind the
> > curtains). So I decided to try running the benchmark on tmpfs, to get
> > the storage out of the way and get the "best case" results.
> >
> > This makes the pgbench progress perfectly "smooth" (no jumps like in the
> > output above), and the comparison looks like this:
> >
> >            |  8                  |  64                | 1024
> >   clients  |  master    patched  |  master   patched  | master  patched
> >   ---------|---------------------|--------------------|----------------
> >         1  |   32449      32032  |   19289     20344  |   3108     3081
> >         4  |   68779      69256  |   24585     29912  |   2915     3449
> >         8  |   79787     100655  |   28217     39217  |   3182     4086
> >        16  |  113024     148968  |   42969     62083  |   5134     5712
> >        32  |  125884     170678  |   44256     71183  |   4910     5447
> >        48  |  125571     166695  |   44693     76411  |   4717     5215
> >        64  |  122096     160470  |   42749     83754  |   4631     5103
> >        96  |  120170     154145  |   42696     86529  |   4556     5020
> >       128  |  119204     152977  |   40880     88163  |   4529     5047
> >       160  |  116081     152708  |   42263     88066  |   4512     5000
> >       196  |  115364     152455  |   40765     88602  |   4505     4952
> >
> > and the comparison to master:
> >
> >   clients         8          64        1024
> >   -----------------------------------------
> >         1       99%        105%         99%
> >         4      101%        122%        118%
> >         8      126%        139%        128%
> >        16      132%        144%        111%
> >        32      136%        161%        111%
> >        48      133%        171%        111%
> >        64      131%        196%        110%
> >        96      128%        203%        110%
> >       128      128%        216%        111%
> >       160      132%        208%        111%
> >       196      132%        217%        110%
> >
> > Yes, with tmpfs the impact looks much more significant. For 8K the
> > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> >
> >
> > That being said, I wonder how big is the impact for practical workloads.
> > ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> > had an example of a more realistic workload, benefiting from this. Of
> > course, it may be the case that there are multiple related bottlenecks,
> > and we'd need to fix all of them - in which case it'd be silly to block
> > the improvements on the grounds that it alone does not help.
> >
> > Another thought is that this is testing the "good case". Can anyone
> > think of a workload that would be made worse by the patch?
>
> I've made similar benchmark on system with two Xeon Gold 5220R with two
> Samsung SSD 970 PRO 1TB mirrored by md.
>
> Configuration changes:
> wal_sync_method = open_datasync
> full_page_writes = off
> synchronous_commit = off
> checkpoint_timeout = 1d
> max_connections = 1000
> max_wal_size = 4GB
> min_wal_size = 640MB
>
> I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
> 1GB) and record size (1kB, 8kB and 64kB).
>
> (I didn't bench 1MB record size, since I don't believe it is critical for
> performance).
>
> Here's results for 64MB segment size and 1GB wal_buffers:
>
> +---------+---------+------------+--------------+----------+
> | recsize | clients | master_tps | nowalbuf_tps | rel_perf |
> +---------+---------+------------+--------------+----------+
> | 1       | 1       | 47991.0    | 46995.0      | 0.98     |
> | 1       | 4       | 171930.0   | 171166.0     | 1.0      |
> | 1       | 16      | 491240.0   | 485132.0     | 0.99     |
> | 1       | 64      | 514590.0   | 515534.0     | 1.0      |
> | 1       | 128     | 547222.0   | 543543.0     | 0.99     |
> | 1       | 256     | 543353.0   | 540802.0     | 1.0      |
> | 8       | 1       | 40976.0    | 41603.0      | 1.02     |
> | 8       | 4       | 89003.0    | 92008.0      | 1.03     |
> | 8       | 16      | 90457.0    | 92282.0      | 1.02     |
> | 8       | 64      | 89293.0    | 92022.0      | 1.03     |
> | 8       | 128     | 92687.0    | 92768.0      | 1.0      |
> | 8       | 256     | 91874.0    | 91665.0      | 1.0      |
> | 64      | 1       | 11829.0    | 12031.0      | 1.02     |
> | 64      | 4       | 11959.0    | 12832.0      | 1.07     |
> | 64      | 16      | 11331.0    | 13417.0      | 1.18     |
> | 64      | 64      | 11108.0    | 13588.0      | 1.22     |
> | 64      | 128     | 11089.0    | 13648.0      | 1.23     |
> | 64      | 256     | 10381.0    | 13542.0      | 1.3      |
> +---------+---------+------------+--------------+----------+
>
> Numbers for all configurations in attached 'improvements.out' . It shows,
> removing WALBufMappingLock almost always doesn't harm performance and
> usually gives measurable gain.
>
> (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
> maximum tps from 6 runs and took average from remaining).
>
> Also sqlite database is attached with all results. It also contains results
> for patch "Several attempts to lock WALInsertLock" (named "attempts") and
> cumulative patch ("nowalbuf-attempts").
> Suprisingly, "Several attempts" causes measurable impact in some
> configurations with hundreds of clients. So, there're more bottlenecks ahead ))
>
>
> Yes, it is still not "real-world" benchmark. But it at least shows patch is
> harmless.

Thank you for your experiments.  Your results shows up to 30% speedups
on real hardware, not tmpfs.  While this is still a corner case, I
think this is quite a results for a pretty local optimization.  On
small connection number there are some cases above and below 1.0.  I
think this due to statistical error.  If we would calculate average
tps ratio across different experiments, for low number of clients it's
still above 1.0.

sqlite> select clients, avg(ratio) from (select walseg, walbuf,
recsize, clients, (avg(tps) filter (where branch =
'nowalbuf'))/(avg(tps) filter (where branch = 'master')) as ratio from
results where branch in ('master', 'nowalbuf') group by walseg,
walbuf, recsize, clients) x group by clients;
1|1.00546614169766
4|1.00782085856889
16|1.02257892337757
64|1.04400167838906
128|1.04134006876033
256|1.04627949500578

I'm going to push the first patch ("nowalbuf") if no objections.  I
think the second one ("Several attempts") still needs more work, as
there are regressions.

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Using read stream in autoprewarm
Next
From: Fujii Masao
Date:
Subject: Re: Truncate logs by max_log_size