Re: [RFC] Lock-free XLog Reservation from WAL - Mailing list pgsql-hackers
From | Japin Li |
---|---|
Subject | Re: [RFC] Lock-free XLog Reservation from WAL |
Date | |
Msg-id | ME0P300MB0445036B2C0FB03769E901B8B6FD2@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Whole thread Raw |
In response to | Re: [RFC] Lock-free XLog Reservation from WAL ("Zhou, Zhiguo" <zhiguo.zhou@intel.com>) |
List | pgsql-hackers |
On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote: > On 2/5/2025 4:32 PM, Japin Li wrote: >> On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote: >>> On 1/26/2025 10:59 PM, Yura Sokolov wrote: >>>> 24.01.2025 12:07, Japin Li пишет: >>>>> On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote: >>>>>> On Thu, 23 Jan 2025 at 15:03, Yura Sokolov >>>>>> <y.sokolov@postgrespro.ru> wrote: >>>>>>> 23.01.2025 11:46, Japin Li пишет: >>>>>>>> On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote: >>>>>>>>> On Wed, 22 Jan 2025 at 17:02, Yura Sokolov >>>>>>>>> <y.sokolov@postgrespro.ru> wrote: >>>>>>>>>> I believe, I know why it happens: I was in hurry making v2 by >>>>>>>>>> cherry-picking internal version. I reverted some changes in >>>>>>>>>> CalcCuckooPositions manually and forgot to add modulo >>>>>>>>>> PREV_LINKS_HASH_CAPA. >>>>>>>>>> >>>>>>>>>> Here's the fix: >>>>>>>>>> >>>>>>>>>> pos->pos[0] = hash % PREV_LINKS_HASH_CAPA; >>>>>>>>>> - pos->pos[1] = pos->pos[0] + 1; >>>>>>>>>> + pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA; >>>>>>>>>> pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA; >>>>>>>>>> - pos->pos[3] = pos->pos[2] + 2; >>>>>>>>>> + pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA; >>>>>>>>>> >>>>>>>>>> Any way, here's v3: >>>>>>>>>> - excess file "v0-0001-Increase..." removed. I believe it was source >>>>>>>>>> of white-space apply warnings. >>>>>>>>>> - this mistake fixed >>>>>>>>>> - more clear slots strategies + "8 positions in two >>>>>>>>>> cache-lines" strategy. >>>>>>>>>> >>>>>>>>>> You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 >>>>>>>>>> and see >>>>>>>>>> if it affects measurably. >>>>>>>>> >>>>>>>>> Thanks for your quick fixing. I will retest it tomorrow. >>>>>>>> Hi, Yura Sokolov >>>>>>>> Here is my test result of the v3 patch: >>>>>>>> | case | min | avg | max >>>>>>>> | >>>>>>>> |-------------------------------+------------+------------ >>>>>>>> +------------| >>>>>>>> | master (44b61efb79) | 865,743.55 | 871,237.40 | >>>>>>>> 874,492.59 | >>>>>>>> | v3 | 857,020.58 | 860,180.11 | >>>>>>>> 864,355.00 | >>>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | >>>>>>>> 858,436.44 | >>>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | >>>>>>>> 865,396.42 | >>>>>>>> It seems there are some performance decreases :( or something I >>>>>>>> missed? >>>>>>>> >>>>>>> >>>>>>> Hi, Japin. >>>>>>> (Excuse me for duplicating message, I found I sent it only to you >>>>>>> first time). >>>>>>> >>>>>> Never mind! >>>>>> >>>>>>> v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself. >>>>>>> With only 8 in-progress inserters spin-lock is certainly better >>>>>>> than any >>>>>>> more complex solution. >>>>>>> >>>>>>> You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs >>>>>>> "master + NUM_XLOGINSERT_LOCKS=64 + v3". >>>>>>> >>>>>>> And even this way I don't claim "Lock-free reservation" gives any >>>>>>> profit. >>>>>>> >>>>>>> That is why your benchmarking is very valuable! It could answer, does >>>>>>> we need such not-small patch, or there is no real problem at all? >>>>>>> >>>>> >>>>> Hi, Yura Sokolov >>>>> >>>>> Here is the test result compared with NUM_XLOGINSERT_LOCKS and the >>>>> v3 patch. >>>>> >>>>> | case | min | avg | >>>>> max | rate% | >>>>> |-----------------------+--------------+--------------+-------------- >>>>> +-------| >>>>> | master (4108440) | 891,225.77 | 904,868.75 | >>>>> 913,708.17 | | >>>>> | lock 64 | 1,007,716.95 | 1,012,013.22 | >>>>> 1,018,674.00 | 11.84 | >>>>> | lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 | >>>>> 1,019,328.36 | 12.47 | >>>>> | lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 | >>>>> 1,021,513.14 | 12.52 | >>>>> | lock 128 | 1,010,147.38 | 1,014,128.11 | >>>>> 1,018,672.01 | 12.07 | >>>>> | lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 | >>>>> 1,031,365.42 | 13.09 | >>>>> | lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 | >>>>> 1,023,696.00 | 12.61 | >>>>> | lock 64 v3 | 1,010,893.30 | 1,022,787.25 | >>>>> 1,029,200.26 | 13.03 | >>>>> | lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 | >>>>> 1,025,511.62 | 12.70 | >>>>> | lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 | >>>>> 1,020,200.57 | 12.54 | >>>>> | lock 128 v3 | 1,012,653.14 | 1,013,637.09 | >>>>> 1,014,358.69 | 12.02 | >>>>> | lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | >>>>> 1,024,597.15 | 12.38 | >>>>> | lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | >>>>> 1,027,855.90 | 13.24 | >>> >>> The data looks really interesting and I recognize the need for further >>> investigation. I'm not very familiar with BenchmarkSQL but we've done >>> similar tests with HammerDB/TPCC by solely increasing >>> NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant >>> performance drop of ~50% and the cycle ratio of spinlock acquisition >>> (s_lock) rose to over 60% of the total, which is basically consistent >>> with the previous findings in [1]. >>> >>> Could you please share the details of your test environment, including >>> the device, configuration, and test approach, so we can collaborate on >>> understanding the differences? >> Sorry for the late reply. I'm on my vacation. >> I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory, >> and >> I use 0-3 run the database, and 4-7 run the BenchmarkSQL. >> Here is my database settings: >> listen_addresses = '*' >> max_connections = '1050' >> shared_buffers = '100GB' >> work_mem = '64MB' >> maintenance_work_mem = '512MB' >> max_wal_size = '50GB' >> min_wal_size = '10GB' >> random_page_cost = '1.1' >> wal_buffers = '1GB' >> wal_level = 'minimal' >> max_wal_senders = '0' >> wal_sync_method = 'open_datasync' >> wal_compression = 'lz4' >> track_activities = 'off' >> checkpoint_timeout = '1d' >> checkpoint_completion_target = '0.95' >> effective_cache_size = '300GB' >> effective_io_concurrency = '32' >> update_process_title = 'off' >> password_encryption = 'md5' >> huge_pages = 'on' >> >>>> Sorry for pause, it was my birthday, so I was on short vacation. >>>> So, in total: >>>> - increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps >>>> - additional lock attempts seems to help a bit in this benchmark, >>>> but it helps more in other (rather synthetic) benchmark [1] >>>> - my version of lock-free reservation looks to help a bit when >>>> applied alone, but look strange in conjunction with additional >>>> lock attempts. >>>> I don't see small improvement from my version of Lock-Free >>>> reservation >>>> (1.1% = 1023/1012) pays for its complexity at the moment. >>> >>> Due to limited hardware resources, I only had the opportunity to >>> measure the performance impact of your v1 patch of the lock-free hash >>> table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I >>> observed an improvement of *76.4%* (RSD: 4.1%) when combining them >>> together on the SPR with 480 vCPUs. I understand that your test >>> devices may not have as many cores, which might be why this >>> optimization brings an unnoticeable impact. However, I don't think >>> this is an unreal problem. In fact, this issue was raised by our >>> customer who is trying to deploy Postgres on devices with hundreds of >>> cores, and I believe the resolution of this performance issue would >>> result in real impacts. >>> >>>> Probably, when other places will be optimized/improved, it will pay >>>> more. >>>> Or probably Zhiguo Zhou's version will perform better. >>>> >>> >>> Our primary difference lies in the approach to handling the prev-link, >>> either via the hash table or directly within the XLog buffer. During >>> my analysis, I didn't identify significant hotspots in the hash table >>> functions, leading me to believe that both implementations should >>> achieve comparable performance improvements. >>> >>> Following your advice, I revised my implementation to update the >>> prev-link atomically and resolved the known TAP tests. However, I >>> encountered the last failure in the recovery/t/027_stream_regress.pl >>> test. Addressing this issue might require a redesign of the underlying >>> writing convention of XLog, which I believe is not necessary, >>> especially since your implementation already achieves the desired >>> performance improvements without suffering from the test failures. I >>> think we may need to focus on your implementation in the next phase. >>> >>>> I think, we could measure theoretical benefit by completely ignoring >>>> fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could >>>> measure. It passes almost all "recovery" tests, though fails two >>>> strictly dependent on xl_prev. >>> >> >>> I currently don't have access to the high-core-count device, but I >>> plan to measure the performance impact of your latest patch and the >>> "Dump-lock-free..." patch once I regain access. >>>> [1] https://postgr.es/m/3b11fdc2-9793-403d- >>>> b3d4-67ff9a00d447%40postgrespro.ru >>>> ------ >>>> regards >>>> Yura >>> >>> Hi Yura and Japin, >>> >>> Thanks so much for your recent patch works and discussions which >>> inspired me a lot! I agree with you that we need to: >>> - Align the test approach and environment >>> - Address the motivation and necessity of this optimization >>> - Further identify the optimization opportunities after applying >>> Yura's patch >>> >>> WDYT? >>> >>> [1] >>> https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj >>> >>> Regards, >>> Zhiguo >> > > Hi Japin, > > Apologies for the delay in responding—I've just returned from > vacation. To move things forward, I'll be running the BenchmarkSQL > workload on my end shortly. > > In the meantime, could you run the HammerDB/TPCC workload on your > device? We've observed significant performance improvements with this > test, and it might help clarify whether the discrepancies we're seeing > stem from the workload itself. Thanks! > Sorry, I currently don't have access to the test device, I will try to test it if I can regain access. -- Regrads, Japin Li
pgsql-hackers by date: