Re: [RFC] Lock-free XLog Reservation from WAL - Mailing list pgsql-hackers
From | Yura Sokolov |
---|---|
Subject | Re: [RFC] Lock-free XLog Reservation from WAL |
Date | |
Msg-id | 8db045e9-cfd5-4b3f-b259-e7584a081294@postgrespro.ru Whole thread Raw |
In response to | Re: [RFC] Lock-free XLog Reservation from WAL ("Zhou, Zhiguo" <zhiguo.zhou@intel.com>) |
List | pgsql-hackers |
05.03.2025 08:39, Zhou, Zhiguo пишет: > > > On 2/23/2025 8:03 PM, Yura Sokolov wrote: >> 14.02.2025 11:41, Zhou, Zhiguo пишет: >>> >>> >>> On 2/11/2025 9:25 AM, Japin Li wrote: >>>> On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote: >>>>> On 2/5/2025 4:32 PM, Japin Li wrote: >>>>>> On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote: >>>>>>> On 1/26/2025 10:59 PM, Yura Sokolov wrote: >>>>>>>> 24.01.2025 12:07, Japin Li пишет: >>>>>>>>> On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote: >>>>>>>>>> On Thu, 23 Jan 2025 at 15:03, Yura Sokolov >>>>>>>>>> <y.sokolov@postgrespro.ru> wrote: >>>>>>>>>>> 23.01.2025 11:46, Japin Li пишет: >>>>>>>>>>>> On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote: >>>>>>>>>>>>> On Wed, 22 Jan 2025 at 17:02, Yura Sokolov >>>>>>>>>>>>> <y.sokolov@postgrespro.ru> wrote: >>>>>>>>>>>>>> I believe, I know why it happens: I was in hurry making v2 by >>>>>>>>>>>>>> cherry-picking internal version. I reverted some changes in >>>>>>>>>>>>>> CalcCuckooPositions manually and forgot to add modulo >>>>>>>>>>>>>> PREV_LINKS_HASH_CAPA. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here's the fix: >>>>>>>>>>>>>> >>>>>>>>>>>>>> pos->pos[0] = hash % PREV_LINKS_HASH_CAPA; >>>>>>>>>>>>>> - pos->pos[1] = pos->pos[0] + 1; >>>>>>>>>>>>>> + pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA; >>>>>>>>>>>>>> pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA; >>>>>>>>>>>>>> - pos->pos[3] = pos->pos[2] + 2; >>>>>>>>>>>>>> + pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA; >>>>>>>>>>>>>> >>>>>>>>>>>>>> Any way, here's v3: >>>>>>>>>>>>>> - excess file "v0-0001-Increase..." removed. I believe it was source >>>>>>>>>>>>>> of white-space apply warnings. >>>>>>>>>>>>>> - this mistake fixed >>>>>>>>>>>>>> - more clear slots strategies + "8 positions in two >>>>>>>>>>>>>> cache-lines" strategy. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 >>>>>>>>>>>>>> and see >>>>>>>>>>>>>> if it affects measurably. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for your quick fixing. I will retest it tomorrow. >>>>>>>>>>>> Hi, Yura Sokolov >>>>>>>>>>>> Here is my test result of the v3 patch: >>>>>>>>>>>> | case | min | avg | max >>>>>>>>>>>> | >>>>>>>>>>>> |-------------------------------+------------+------------ >>>>>>>>>>>> +------------| >>>>>>>>>>>> | master (44b61efb79) | 865,743.55 | 871,237.40 | >>>>>>>>>>>> 874,492.59 | >>>>>>>>>>>> | v3 | 857,020.58 | 860,180.11 | >>>>>>>>>>>> 864,355.00 | >>>>>>>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | >>>>>>>>>>>> 858,436.44 | >>>>>>>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | >>>>>>>>>>>> 865,396.42 | >>>>>>>>>>>> It seems there are some performance decreases :( or something I >>>>>>>>>>>> missed? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, Japin. >>>>>>>>>>> (Excuse me for duplicating message, I found I sent it only to you >>>>>>>>>>> first time). >>>>>>>>>>> >>>>>>>>>> Never mind! >>>>>>>>>> >>>>>>>>>>> v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself. >>>>>>>>>>> With only 8 in-progress inserters spin-lock is certainly better >>>>>>>>>>> than any >>>>>>>>>>> more complex solution. >>>>>>>>>>> >>>>>>>>>>> You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs >>>>>>>>>>> "master + NUM_XLOGINSERT_LOCKS=64 + v3". >>>>>>>>>>> >>>>>>>>>>> And even this way I don't claim "Lock-free reservation" gives any >>>>>>>>>>> profit. >>>>>>>>>>> >>>>>>>>>>> That is why your benchmarking is very valuable! It could answer, does >>>>>>>>>>> we need such not-small patch, or there is no real problem at all? >>>>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, Yura Sokolov >>>>>>>>> >>>>>>>>> Here is the test result compared with NUM_XLOGINSERT_LOCKS and the >>>>>>>>> v3 patch. >>>>>>>>> >>>>>>>>> | case | min | avg | >>>>>>>>> max | rate% | >>>>>>>>> |-----------------------+--------------+--------------+-------------- >>>>>>>>> +-------| >>>>>>>>> | master (4108440) | 891,225.77 | 904,868.75 | >>>>>>>>> 913,708.17 | | >>>>>>>>> | lock 64 | 1,007,716.95 | 1,012,013.22 | >>>>>>>>> 1,018,674.00 | 11.84 | >>>>>>>>> | lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 | >>>>>>>>> 1,019,328.36 | 12.47 | >>>>>>>>> | lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 | >>>>>>>>> 1,021,513.14 | 12.52 | >>>>>>>>> | lock 128 | 1,010,147.38 | 1,014,128.11 | >>>>>>>>> 1,018,672.01 | 12.07 | >>>>>>>>> | lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 | >>>>>>>>> 1,031,365.42 | 13.09 | >>>>>>>>> | lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 | >>>>>>>>> 1,023,696.00 | 12.61 | >>>>>>>>> | lock 64 v3 | 1,010,893.30 | 1,022,787.25 | >>>>>>>>> 1,029,200.26 | 13.03 | >>>>>>>>> | lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 | >>>>>>>>> 1,025,511.62 | 12.70 | >>>>>>>>> | lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 | >>>>>>>>> 1,020,200.57 | 12.54 | >>>>>>>>> | lock 128 v3 | 1,012,653.14 | 1,013,637.09 | >>>>>>>>> 1,014,358.69 | 12.02 | >>>>>>>>> | lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | >>>>>>>>> 1,024,597.15 | 12.38 | >>>>>>>>> | lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | >>>>>>>>> 1,027,855.90 | 13.24 | >>>>>>> >>>>>>> The data looks really interesting and I recognize the need for further >>>>>>> investigation. I'm not very familiar with BenchmarkSQL but we've done >>>>>>> similar tests with HammerDB/TPCC by solely increasing >>>>>>> NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant >>>>>>> performance drop of ~50% and the cycle ratio of spinlock acquisition >>>>>>> (s_lock) rose to over 60% of the total, which is basically consistent >>>>>>> with the previous findings in [1]. >>>>>>> >>>>>>> Could you please share the details of your test environment, including >>>>>>> the device, configuration, and test approach, so we can collaborate on >>>>>>> understanding the differences? >>>>>> Sorry for the late reply. I'm on my vacation. >>>>>> I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory, >>>>>> and >>>>>> I use 0-3 run the database, and 4-7 run the BenchmarkSQL. >>>>>> Here is my database settings: >>>>>> listen_addresses = '*' >>>>>> max_connections = '1050' >>>>>> shared_buffers = '100GB' >>>>>> work_mem = '64MB' >>>>>> maintenance_work_mem = '512MB' >>>>>> max_wal_size = '50GB' >>>>>> min_wal_size = '10GB' >>>>>> random_page_cost = '1.1' >>>>>> wal_buffers = '1GB' >>>>>> wal_level = 'minimal' >>>>>> max_wal_senders = '0' >>>>>> wal_sync_method = 'open_datasync' >>>>>> wal_compression = 'lz4' >>>>>> track_activities = 'off' >>>>>> checkpoint_timeout = '1d' >>>>>> checkpoint_completion_target = '0.95' >>>>>> effective_cache_size = '300GB' >>>>>> effective_io_concurrency = '32' >>>>>> update_process_title = 'off' >>>>>> password_encryption = 'md5' >>>>>> huge_pages = 'on' >>>>>> >>>>>>>> Sorry for pause, it was my birthday, so I was on short vacation. >>>>>>>> So, in total: >>>>>>>> - increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps >>>>>>>> - additional lock attempts seems to help a bit in this benchmark, >>>>>>>> but it helps more in other (rather synthetic) benchmark [1] >>>>>>>> - my version of lock-free reservation looks to help a bit when >>>>>>>> applied alone, but look strange in conjunction with additional >>>>>>>> lock attempts. >>>>>>>> I don't see small improvement from my version of Lock-Free >>>>>>>> reservation >>>>>>>> (1.1% = 1023/1012) pays for its complexity at the moment. >>>>>>> >>>>>>> Due to limited hardware resources, I only had the opportunity to >>>>>>> measure the performance impact of your v1 patch of the lock-free hash >>>>>>> table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I >>>>>>> observed an improvement of *76.4%* (RSD: 4.1%) when combining them >>>>>>> together on the SPR with 480 vCPUs. I understand that your test >>>>>>> devices may not have as many cores, which might be why this >>>>>>> optimization brings an unnoticeable impact. However, I don't think >>>>>>> this is an unreal problem. In fact, this issue was raised by our >>>>>>> customer who is trying to deploy Postgres on devices with hundreds of >>>>>>> cores, and I believe the resolution of this performance issue would >>>>>>> result in real impacts. >>>>>>> >>>>>>>> Probably, when other places will be optimized/improved, it will pay >>>>>>>> more. >>>>>>>> Or probably Zhiguo Zhou's version will perform better. >>>>>>>> >>>>>>> >>>>>>> Our primary difference lies in the approach to handling the prev-link, >>>>>>> either via the hash table or directly within the XLog buffer. During >>>>>>> my analysis, I didn't identify significant hotspots in the hash table >>>>>>> functions, leading me to believe that both implementations should >>>>>>> achieve comparable performance improvements. >>>>>>> >>>>>>> Following your advice, I revised my implementation to update the >>>>>>> prev-link atomically and resolved the known TAP tests. However, I >>>>>>> encountered the last failure in the recovery/t/027_stream_regress.pl >>>>>>> test. Addressing this issue might require a redesign of the underlying >>>>>>> writing convention of XLog, which I believe is not necessary, >>>>>>> especially since your implementation already achieves the desired >>>>>>> performance improvements without suffering from the test failures. I >>>>>>> think we may need to focus on your implementation in the next phase. >>>>>>> >>>>>>>> I think, we could measure theoretical benefit by completely ignoring >>>>>>>> fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could >>>>>>>> measure. It passes almost all "recovery" tests, though fails two >>>>>>>> strictly dependent on xl_prev. >>>>>>> >>>>>> >>>>>>> I currently don't have access to the high-core-count device, but I >>>>>>> plan to measure the performance impact of your latest patch and the >>>>>>> "Dump-lock-free..." patch once I regain access. >>>>>>>> [1] https://postgr.es/m/3b11fdc2-9793-403d- >>>>>>>> b3d4-67ff9a00d447%40postgrespro.ru >>>>>>>> ------ >>>>>>>> regards >>>>>>>> Yura >>>>>>> >>>>>>> Hi Yura and Japin, >>>>>>> >>>>>>> Thanks so much for your recent patch works and discussions which >>>>>>> inspired me a lot! I agree with you that we need to: >>>>>>> - Align the test approach and environment >>>>>>> - Address the motivation and necessity of this optimization >>>>>>> - Further identify the optimization opportunities after applying >>>>>>> Yura's patch >>>>>>> >>>>>>> WDYT? >>>>>>> >>>>>>> [1] >>>>>>> https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj >>>>>>> >>>>>>> Regards, >>>>>>> Zhiguo >>>>>> >>>>> >>>>> Hi Japin, >>>>> >>>>> Apologies for the delay in responding—I've just returned from >>>>> vacation. To move things forward, I'll be running the BenchmarkSQL >>>>> workload on my end shortly. >>>>> >>>>> In the meantime, could you run the HammerDB/TPCC workload on your >>>>> device? We've observed significant performance improvements with this >>>>> test, and it might help clarify whether the discrepancies we're seeing >>>>> stem from the workload itself. Thanks! >>>>> >>>> >>>> Sorry, I currently don't have access to the test device, I will try to test >>>> it if I can regain access. >>>> >>> >>> Good day, Yura and Japin! >>> >>> I recently acquired the SUT device again and had the opportunity to >>> conduct performance experiments using the TPC-C benchmark (pg_count_ware >>> 757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the >>> results and key observations: >>> >>> +----------------+-------------+------------+-------------+------------+ >>> | Version | NOPM | NOPM Gain% | TPM | TPM(Gain%) | >>> +----------------+-------------+------------+-------------+------------+ >>> | master(b4a07f5)| 1,681,233 | 0.0% | 3,874,491 | 0.0% | >>> | 64-lock | 643,853 | -61.7% | 1,479,647 | -61.8% | >>> | 64-lock-v4 | 2,423,972 | 44.2% | 5,577,580 | 44.0% | >>> | 128-lock | 462,993 | -72.5% | 1,064,733 | -72.5% | >>> | 128-lock-v4 | 2,468,034 | 46.8% | 5,673,349 | 46.4% | >>> +----------------+-------------+------------+-------------+------------+ >>> >>> - Though the baseline (b4a07f5) has improved compared to when we created >>> this mailing list, we still achieve 44% improvement with this optimization. >>> - Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe >>> performance regression due to intensified lock contention. >>> - Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog >>> insertion optimization jointly improve overall performance. >>> - 64 locks seems the sweet spot for achieving the most performance >>> improvement. >>> >>> I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not >>> sure if the difference of their implementation of TPCC would lead to >>> some performance gap). I observed that: >>> >>> - The performance indicator (NOPM) shows differences of several >>> magnitudes compared to Japin's results. >>> - NOPM/TPM seems insensitive to code changes (lock count increase, >>> lock-free algorithm), which is quite strange. >>> - Possible reasons may include: 1) scaling parameters [1] are not >>> aligned, 2) test configuration did not reach the pain point of the XLog >>> insertions. >>> >>> And I noticed a 64-core device (32 cores for the server) was used in >>> Japin's test. In our previous core-scaling test (attached), 32/64 cores >>> may not be enough to show the impact of the optimization, I think that >>> would be one of the reason why Japin observed minimal impact from the >>> lock-free optimization. >>> >>> In summary, I think: >>> - The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB >>> effectively reflects performance in XLog insertions. >>> - This test on a device with hundreds of cores reflects a real user >>> scenario, making it a significant consideration. >>> - The lock-free algorithm with the lock count increased to 64 can bring >>> significant performance improvements. >>> >>> So I propose to continue the code review process for this optimization >>> patch. WDYT? >>> >>> [1] >>> https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters >> >> Good day. >> >> I'll just repeat my answer from personal mail: >> >> I'm impressed with results. I really didn't expect it is so important for >> huge servers. >> >> Main problem will be to prove this patch doesn't harm performance on >> smaller servers. Or made things configurable so that smaller server still >> uses simpler code. >> >> ------- >> regards >> Yura Sokolov aka funny-falcon > > Good day, Yura! > > Firstly, I'd apologize for the delayed response due to internal > urgencies and the time required to set up the tests on a new device. > > Regarding your concerns about the potential negative impact on > performance, I have conducted further evaluations. To assess the > performance impact of the patch on a smaller device, I located another > device with significantly fewer processors. Using the same database and > test configurations (TPC-C benchmark: pg_count_ware 757, vu 256) and > code bases (b4a07f5 as "base" and v4 patch with 64 locks as "opt"), I > performed core scaling tests ranging from 8 to 64 physical cores in > steps of 8. The results (attached) indicate that the optimization does > not lead to performance regression within this low core count range. > > Please kindly let me know if more data is required to move the process > forward. > > I look forward to your insights. Good day, Zhiguo. Thank you a lot for testing! I will validate on servers I have access too (and on notebook). To be honestly, I didn't bench v4, and it fixes cache-line sharing issue i mistakenly introduced in previous version. So probably it is really doesn't affect performance as v3 did. ------- regards Yura Sokolov aka funny-falcon
pgsql-hackers by date: