Re: Get rid of WALBufMappingLock - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: Get rid of WALBufMappingLock |
Date | |
Msg-id | CAPpHfdtOBgoUfGiw9exZPpS7e4EE7SdEBRt9VDfgYJpC2Jd5DA@mail.gmail.com Whole thread Raw |
In response to | Re: Get rid of WALBufMappingLock (Yura Sokolov <y.sokolov@postgrespro.ru>) |
List | pgsql-hackers |
On Mon, Mar 31, 2025 at 1:42 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > 14.03.2025 17:30, Tomas Vondra wrote: > > Hi, > > > > I've briefly looked at this patch this week, and done a bit of testing. > > I don't have any comments about the correctness - it does seem correct > > to me and I haven't noticed any crashes/issues, but I'm not familiar > > with the WALBufMappingLock enough to have insightful opinions. > > > > I have however decided to do a bit of benchmarking, to better understand > > the possible benefits of the change. I happen to have access to an Azure > > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that > > can do ~1.5GB/s. > > > > The benchmark script (attached) uses the workload mentioned by Andres > > some time ago [1] > > > > SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE)); > > > > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated > > results look like this (this is throughput): > > > > | 8 | 64 | 1024 > > clients | master patched | master patched | master patched > > --------------------------------------------------------------------- > > 1 | 11864 12035 | 7419 7345 | 968 940 > > 4 | 26311 26919 | 12414 12308 | 1304 1293 > > 8 | 38742 39651 | 14316 14539 | 1348 1348 > > 16 | 57299 59917 | 15405 15871 | 1304 1279 > > 32 | 74857 82598 | 17589 17126 | 1233 1233 > > 48 | 87596 95495 | 18616 18160 | 1199 1227 > > 64 | 89982 97715 | 19033 18910 | 1196 1221 > > 96 | 92853 103448 | 19694 19706 | 1190 1210 > > 128 | 95392 103324 | 20085 19873 | 1188 1213 > > 160 | 94933 102236 | 20227 20323 | 1180 1214 > > 196 | 95933 103341 | 20448 20513 | 1188 1199 > > > > To put this into a perspective, this throughput relative to master: > > > > clients | 8 64 1024 > > ---------------------------------- > > 1 | 101% 99% 97% > > 4 | 102% 99% 99% > > 8 | 102% 102% 100% > > 16 | 105% 103% 98% > > 32 | 110% 97% 100% > > 48 | 109% 98% 102% > > 64 | 109% 99% 102% > > 96 | 111% 100% 102% > > 128 | 108% 99% 102% > > 160 | 108% 100% 103% > > 196 | 108% 100% 101% > > > > That does not seem like a huge improvement :-( Yes, there's 1-10% > > speedup for the small (8K) size, but for larger chunks it's a wash. > > > > Looking at the pgbench progress, I noticed stuff like this: > > > > ... > > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed > > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed > > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed > > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed > > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed > > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed > > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed > > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed > > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed > > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed > > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed > > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed > > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed > > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed > > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed > > ... > > > > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny > > things (it's a virtual SSD, I'm not sure what model is that behind the > > curtains). So I decided to try running the benchmark on tmpfs, to get > > the storage out of the way and get the "best case" results. > > > > This makes the pgbench progress perfectly "smooth" (no jumps like in the > > output above), and the comparison looks like this: > > > > | 8 | 64 | 1024 > > clients | master patched | master patched | master patched > > ---------|---------------------|--------------------|---------------- > > 1 | 32449 32032 | 19289 20344 | 3108 3081 > > 4 | 68779 69256 | 24585 29912 | 2915 3449 > > 8 | 79787 100655 | 28217 39217 | 3182 4086 > > 16 | 113024 148968 | 42969 62083 | 5134 5712 > > 32 | 125884 170678 | 44256 71183 | 4910 5447 > > 48 | 125571 166695 | 44693 76411 | 4717 5215 > > 64 | 122096 160470 | 42749 83754 | 4631 5103 > > 96 | 120170 154145 | 42696 86529 | 4556 5020 > > 128 | 119204 152977 | 40880 88163 | 4529 5047 > > 160 | 116081 152708 | 42263 88066 | 4512 5000 > > 196 | 115364 152455 | 40765 88602 | 4505 4952 > > > > and the comparison to master: > > > > clients 8 64 1024 > > ----------------------------------------- > > 1 99% 105% 99% > > 4 101% 122% 118% > > 8 126% 139% 128% > > 16 132% 144% 111% > > 32 136% 161% 111% > > 48 133% 171% 111% > > 64 131% 196% 110% > > 96 128% 203% 110% > > 128 128% 216% 111% > > 160 132% 208% 111% > > 196 132% 217% 110% > > > > Yes, with tmpfs the impact looks much more significant. For 8K the > > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x. > > > > > > That being said, I wonder how big is the impact for practical workloads. > > ISTM this workload is pretty narrow / extreme, it'd be much easier if we > > had an example of a more realistic workload, benefiting from this. Of > > course, it may be the case that there are multiple related bottlenecks, > > and we'd need to fix all of them - in which case it'd be silly to block > > the improvements on the grounds that it alone does not help. > > > > Another thought is that this is testing the "good case". Can anyone > > think of a workload that would be made worse by the patch? > > I've made similar benchmark on system with two Xeon Gold 5220R with two > Samsung SSD 970 PRO 1TB mirrored by md. > > Configuration changes: > wal_sync_method = open_datasync > full_page_writes = off > synchronous_commit = off > checkpoint_timeout = 1d > max_connections = 1000 > max_wal_size = 4GB > min_wal_size = 640MB > > I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and > 1GB) and record size (1kB, 8kB and 64kB). > > (I didn't bench 1MB record size, since I don't believe it is critical for > performance). > > Here's results for 64MB segment size and 1GB wal_buffers: > > +---------+---------+------------+--------------+----------+ > | recsize | clients | master_tps | nowalbuf_tps | rel_perf | > +---------+---------+------------+--------------+----------+ > | 1 | 1 | 47991.0 | 46995.0 | 0.98 | > | 1 | 4 | 171930.0 | 171166.0 | 1.0 | > | 1 | 16 | 491240.0 | 485132.0 | 0.99 | > | 1 | 64 | 514590.0 | 515534.0 | 1.0 | > | 1 | 128 | 547222.0 | 543543.0 | 0.99 | > | 1 | 256 | 543353.0 | 540802.0 | 1.0 | > | 8 | 1 | 40976.0 | 41603.0 | 1.02 | > | 8 | 4 | 89003.0 | 92008.0 | 1.03 | > | 8 | 16 | 90457.0 | 92282.0 | 1.02 | > | 8 | 64 | 89293.0 | 92022.0 | 1.03 | > | 8 | 128 | 92687.0 | 92768.0 | 1.0 | > | 8 | 256 | 91874.0 | 91665.0 | 1.0 | > | 64 | 1 | 11829.0 | 12031.0 | 1.02 | > | 64 | 4 | 11959.0 | 12832.0 | 1.07 | > | 64 | 16 | 11331.0 | 13417.0 | 1.18 | > | 64 | 64 | 11108.0 | 13588.0 | 1.22 | > | 64 | 128 | 11089.0 | 13648.0 | 1.23 | > | 64 | 256 | 10381.0 | 13542.0 | 1.3 | > +---------+---------+------------+--------------+----------+ > > Numbers for all configurations in attached 'improvements.out' . It shows, > removing WALBufMappingLock almost always doesn't harm performance and > usually gives measurable gain. > > (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and > maximum tps from 6 runs and took average from remaining). > > Also sqlite database is attached with all results. It also contains results > for patch "Several attempts to lock WALInsertLock" (named "attempts") and > cumulative patch ("nowalbuf-attempts"). > Suprisingly, "Several attempts" causes measurable impact in some > configurations with hundreds of clients. So, there're more bottlenecks ahead )) > > > Yes, it is still not "real-world" benchmark. But it at least shows patch is > harmless. Thank you for your experiments. Your results shows up to 30% speedups on real hardware, not tmpfs. While this is still a corner case, I think this is quite a results for a pretty local optimization. On small connection number there are some cases above and below 1.0. I think this due to statistical error. If we would calculate average tps ratio across different experiments, for low number of clients it's still above 1.0. sqlite> select clients, avg(ratio) from (select walseg, walbuf, recsize, clients, (avg(tps) filter (where branch = 'nowalbuf'))/(avg(tps) filter (where branch = 'master')) as ratio from results where branch in ('master', 'nowalbuf') group by walseg, walbuf, recsize, clients) x group by clients; 1|1.00546614169766 4|1.00782085856889 16|1.02257892337757 64|1.04400167838906 128|1.04134006876033 256|1.04627949500578 I'm going to push the first patch ("nowalbuf") if no objections. I think the second one ("Several attempts") still needs more work, as there are regressions. ------ Regards, Alexander Korotkov Supabase
pgsql-hackers by date: