Re: Get rid of WALBufMappingLock - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: Get rid of WALBufMappingLock
Date
Msg-id 7a47881d-c857-4c6c-8daa-7e6e6ac25058@postgrespro.ru
Whole thread Raw
In response to Re: Get rid of WALBufMappingLock  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Get rid of WALBufMappingLock
List pgsql-hackers
Good day,

14.03.2025 17:30, Tomas Vondra wrote:
> Hi,
> 
> I've briefly looked at this patch this week, and done a bit of testing.
> I don't have any comments about the correctness - it does seem correct
> to me and I haven't noticed any crashes/issues, but I'm not familiar
> with the WALBufMappingLock enough to have insightful opinions.
> 
> I have however decided to do a bit of benchmarking, to better understand
> the possible benefits of the change. I happen to have access to an Azure
> machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> can do ~1.5GB/s.
> 
> The benchmark script (attached) uses the workload mentioned by Andres
> some time ago [1]
> 
>    SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> 
> with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> results look like this (this is throughput):
> 
>            |  8                 |  64                |  1024
>   clients  |  master   patched  |  master   patched  |  master  patched
>   ---------------------------------------------------------------------
>         1  |   11864     12035  |    7419      7345  |     968      940
>         4  |   26311     26919  |   12414     12308  |    1304     1293
>         8  |   38742     39651  |   14316     14539  |    1348     1348
>        16  |   57299     59917  |   15405     15871  |    1304     1279
>        32  |   74857     82598  |   17589     17126  |    1233     1233
>        48  |   87596     95495  |   18616     18160  |    1199     1227
>        64  |   89982     97715  |   19033     18910  |    1196     1221
>        96  |   92853    103448  |   19694     19706  |    1190     1210
>       128  |   95392    103324  |   20085     19873  |    1188     1213
>       160  |   94933    102236  |   20227     20323  |    1180     1214
>       196  |   95933    103341  |   20448     20513  |    1188     1199
> 
> To put this into a perspective, this throughput relative to master:
> 
>   clients  |     8      64     1024
>   ----------------------------------
>         1  |  101%     99%      97%
>         4  |  102%     99%      99%
>         8  |  102%    102%     100%
>        16  |  105%    103%      98%
>        32  |  110%     97%     100%
>        48  |  109%     98%     102%
>        64  |  109%     99%     102%
>        96  |  111%    100%     102%
>       128  |  108%     99%     102%
>       160  |  108%    100%     103%
>       196  |  108%    100%     101%
> 
> That does not seem like a huge improvement :-( Yes, there's 1-10%
> speedup for the small (8K) size, but for larger chunks it's a wash.
> 
> Looking at the pgbench progress, I noticed stuff like this:
> 
> ...
> progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> ...
> 
> i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> things (it's a virtual SSD, I'm not sure what model is that behind the
> curtains). So I decided to try running the benchmark on tmpfs, to get
> the storage out of the way and get the "best case" results.
> 
> This makes the pgbench progress perfectly "smooth" (no jumps like in the
> output above), and the comparison looks like this:
> 
>            |  8                  |  64                | 1024
>   clients  |  master    patched  |  master   patched  | master  patched
>   ---------|---------------------|--------------------|----------------
>         1  |   32449      32032  |   19289     20344  |   3108     3081
>         4  |   68779      69256  |   24585     29912  |   2915     3449
>         8  |   79787     100655  |   28217     39217  |   3182     4086
>        16  |  113024     148968  |   42969     62083  |   5134     5712
>        32  |  125884     170678  |   44256     71183  |   4910     5447
>        48  |  125571     166695  |   44693     76411  |   4717     5215
>        64  |  122096     160470  |   42749     83754  |   4631     5103
>        96  |  120170     154145  |   42696     86529  |   4556     5020
>       128  |  119204     152977  |   40880     88163  |   4529     5047
>       160  |  116081     152708  |   42263     88066  |   4512     5000
>       196  |  115364     152455  |   40765     88602  |   4505     4952
> 
> and the comparison to master:
> 
>   clients         8          64        1024
>   -----------------------------------------
>         1       99%        105%         99%
>         4      101%        122%        118%
>         8      126%        139%        128%
>        16      132%        144%        111%
>        32      136%        161%        111%
>        48      133%        171%        111%
>        64      131%        196%        110%
>        96      128%        203%        110%
>       128      128%        216%        111%
>       160      132%        208%        111%
>       196      132%        217%        110%
> 
> Yes, with tmpfs the impact looks much more significant. For 8K the
> speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> 
> 
> That being said, I wonder how big is the impact for practical workloads.
> ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> had an example of a more realistic workload, benefiting from this. Of
> course, it may be the case that there are multiple related bottlenecks,
> and we'd need to fix all of them - in which case it'd be silly to block
> the improvements on the grounds that it alone does not help.
> 
> Another thought is that this is testing the "good case". Can anyone
> think of a workload that would be made worse by the patch?

I've made similar benchmark on system with two Xeon Gold 5220R with two
Samsung SSD 970 PRO 1TB mirrored by md.

Configuration changes:
wal_sync_method = open_datasync
full_page_writes = off
synchronous_commit = off
checkpoint_timeout = 1d
max_connections = 1000
max_wal_size = 4GB
min_wal_size = 640MB

I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
1GB) and record size (1kB, 8kB and 64kB).

(I didn't bench 1MB record size, since I don't believe it is critical for
performance).

Here's results for 64MB segment size and 1GB wal_buffers:

+---------+---------+------------+--------------+----------+
| recsize | clients | master_tps | nowalbuf_tps | rel_perf |
+---------+---------+------------+--------------+----------+
| 1       | 1       | 47991.0    | 46995.0      | 0.98     |
| 1       | 4       | 171930.0   | 171166.0     | 1.0      |
| 1       | 16      | 491240.0   | 485132.0     | 0.99     |
| 1       | 64      | 514590.0   | 515534.0     | 1.0      |
| 1       | 128     | 547222.0   | 543543.0     | 0.99     |
| 1       | 256     | 543353.0   | 540802.0     | 1.0      |
| 8       | 1       | 40976.0    | 41603.0      | 1.02     |
| 8       | 4       | 89003.0    | 92008.0      | 1.03     |
| 8       | 16      | 90457.0    | 92282.0      | 1.02     |
| 8       | 64      | 89293.0    | 92022.0      | 1.03     |
| 8       | 128     | 92687.0    | 92768.0      | 1.0      |
| 8       | 256     | 91874.0    | 91665.0      | 1.0      |
| 64      | 1       | 11829.0    | 12031.0      | 1.02     |
| 64      | 4       | 11959.0    | 12832.0      | 1.07     |
| 64      | 16      | 11331.0    | 13417.0      | 1.18     |
| 64      | 64      | 11108.0    | 13588.0      | 1.22     |
| 64      | 128     | 11089.0    | 13648.0      | 1.23     |
| 64      | 256     | 10381.0    | 13542.0      | 1.3      |
+---------+---------+------------+--------------+----------+

Numbers for all configurations in attached 'improvements.out' . It shows,
removing WALBufMappingLock almost always doesn't harm performance and
usually gives measurable gain.

(Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
maximum tps from 6 runs and took average from remaining).

Also sqlite database is attached with all results. It also contains results
for patch "Several attempts to lock WALInsertLock" (named "attempts") and
cumulative patch ("nowalbuf-attempts").
Suprisingly, "Several attempts" causes measurable impact in some
configurations with hundreds of clients. So, there're more bottlenecks ahead ))


Yes, it is still not "real-world" benchmark. But it at least shows patch is
harmless.

-- 
regards
Yura Sokolov aka funny-falcon
Attachment

pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: Windows: openssl & gssapi dislike each other
Next
From: Yugo Nagata
Date:
Subject: Prevent internal error at concurrent CREATE OR REPLACE FUNCTION