Home > mailing lists
Re: [RFC] Lock-free XLog Reservation from WAL - Mailing list pgsql-hackers

From	Yura Sokolov
Subject	Re: [RFC] Lock-free XLog Reservation from WAL
Date	March 5 12:33:49
Msg-id	8db045e9-cfd5-4b3f-b259-e7584a081294@postgrespro.ru Whole thread Raw
In response to	Re: [RFC] Lock-free XLog Reservation from WAL ("Zhou, Zhiguo" <zhiguo.zhou@intel.com>)
List	pgsql-hackers
Tree view
05.03.2025 08:39, Zhou, Zhiguo пишет:
> 
> 
> On 2/23/2025 8:03 PM, Yura Sokolov wrote:
>> 14.02.2025 11:41, Zhou, Zhiguo пишет:
>>>
>>>
>>> On 2/11/2025 9:25 AM, Japin Li wrote:
>>>> On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:
>>>>> On 2/5/2025 4:32 PM, Japin Li wrote:
>>>>>> On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:
>>>>>>> On 1/26/2025 10:59 PM, Yura Sokolov wrote:
>>>>>>>> 24.01.2025 12:07, Japin Li пишет:
>>>>>>>>> On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:
>>>>>>>>>> On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
>>>>>>>>>> <y.sokolov@postgrespro.ru> wrote:
>>>>>>>>>>> 23.01.2025 11:46, Japin Li пишет:
>>>>>>>>>>>> On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:
>>>>>>>>>>>>> On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
>>>>>>>>>>>>> <y.sokolov@postgrespro.ru> wrote:
>>>>>>>>>>>>>> I believe, I know why it happens: I was in hurry making v2 by
>>>>>>>>>>>>>> cherry-picking internal version. I reverted some changes in
>>>>>>>>>>>>>> CalcCuckooPositions manually and forgot to add modulo
>>>>>>>>>>>>>> PREV_LINKS_HASH_CAPA.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's the fix:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>              pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
>>>>>>>>>>>>>> -       pos->pos[1] = pos->pos[0] + 1;
>>>>>>>>>>>>>> +       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
>>>>>>>>>>>>>>              pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
>>>>>>>>>>>>>> -       pos->pos[3] = pos->pos[2] + 2;
>>>>>>>>>>>>>> +       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any way, here's v3:
>>>>>>>>>>>>>> - excess file "v0-0001-Increase..." removed. I believe it was source
>>>>>>>>>>>>>>        of white-space apply warnings.
>>>>>>>>>>>>>> - this mistake fixed
>>>>>>>>>>>>>> - more clear slots strategies + "8 positions in two
>>>>>>>>>>>>>> cache-lines" strategy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
>>>>>>>>>>>>>> and see
>>>>>>>>>>>>>> if it affects measurably.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your quick fixing.  I will retest it tomorrow.
>>>>>>>>>>>> Hi, Yura Sokolov
>>>>>>>>>>>> Here is my test result of the v3 patch:
>>>>>>>>>>>> | case                          | min        | avg        | max
>>>>>>>>>>>> |
>>>>>>>>>>>> |-------------------------------+------------+------------
>>>>>>>>>>>> +------------|
>>>>>>>>>>>> | master (44b61efb79)           | 865,743.55 | 871,237.40 |
>>>>>>>>>>>> 874,492.59 |
>>>>>>>>>>>> | v3                            | 857,020.58 | 860,180.11 |
>>>>>>>>>>>> 864,355.00 |
>>>>>>>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
>>>>>>>>>>>> 858,436.44 |
>>>>>>>>>>>> | v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
>>>>>>>>>>>> 865,396.42 |
>>>>>>>>>>>> It seems there are some performance decreases :( or something I
>>>>>>>>>>>> missed?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi, Japin.
>>>>>>>>>>> (Excuse me for duplicating message, I found I sent it only to you
>>>>>>>>>>> first time).
>>>>>>>>>>>
>>>>>>>>>> Never mind!
>>>>>>>>>>
>>>>>>>>>>> v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
>>>>>>>>>>> With only 8 in-progress inserters spin-lock is certainly better
>>>>>>>>>>> than any
>>>>>>>>>>> more complex solution.
>>>>>>>>>>>
>>>>>>>>>>> You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
>>>>>>>>>>> "master + NUM_XLOGINSERT_LOCKS=64 + v3".
>>>>>>>>>>>
>>>>>>>>>>> And even this way I don't claim "Lock-free reservation" gives any
>>>>>>>>>>> profit.
>>>>>>>>>>>
>>>>>>>>>>> That is why your benchmarking is very valuable! It could answer, does
>>>>>>>>>>> we need such not-small patch, or there is no real problem at all?
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi, Yura Sokolov
>>>>>>>>>
>>>>>>>>> Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
>>>>>>>>> v3 patch.
>>>>>>>>>
>>>>>>>>> | case                  | min          | avg          |
>>>>>>>>> max          | rate% |
>>>>>>>>> |-----------------------+--------------+--------------+--------------
>>>>>>>>> +-------|
>>>>>>>>> | master (4108440)      | 891,225.77   | 904,868.75   |
>>>>>>>>> 913,708.17   |        |
>>>>>>>>> | lock 64               | 1,007,716.95 | 1,012,013.22 |
>>>>>>>>> 1,018,674.00 | 11.84 |
>>>>>>>>> | lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
>>>>>>>>> 1,019,328.36 | 12.47 |
>>>>>>>>> | lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
>>>>>>>>> 1,021,513.14 | 12.52 |
>>>>>>>>> | lock 128              | 1,010,147.38 | 1,014,128.11 |
>>>>>>>>> 1,018,672.01 | 12.07 |
>>>>>>>>> | lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
>>>>>>>>> 1,031,365.42 | 13.09 |
>>>>>>>>> | lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
>>>>>>>>> 1,023,696.00 | 12.61 |
>>>>>>>>> | lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
>>>>>>>>> 1,029,200.26 | 13.03 |
>>>>>>>>> | lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
>>>>>>>>> 1,025,511.62 | 12.70 |
>>>>>>>>> | lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
>>>>>>>>> 1,020,200.57 | 12.54 |
>>>>>>>>> | lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
>>>>>>>>> 1,014,358.69 | 12.02 |
>>>>>>>>> | lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
>>>>>>>>> 1,024,597.15 | 12.38 |
>>>>>>>>> | lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
>>>>>>>>> 1,027,855.90 | 13.24 |
>>>>>>>
>>>>>>> The data looks really interesting and I recognize the need for further
>>>>>>> investigation. I'm not very familiar with BenchmarkSQL but we've done
>>>>>>> similar tests with HammerDB/TPCC by solely increasing
>>>>>>> NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
>>>>>>> performance drop of ~50% and the cycle ratio of spinlock acquisition
>>>>>>> (s_lock) rose to over 60% of the total, which is basically consistent
>>>>>>> with the previous findings in [1].
>>>>>>>
>>>>>>> Could you please share the details of your test environment, including
>>>>>>> the device, configuration, and test approach, so we can collaborate on
>>>>>>> understanding the differences?
>>>>>> Sorry for the late reply.  I'm on my vacation.
>>>>>> I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
>>>>>> and
>>>>>> I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
>>>>>> Here is my database settings:
>>>>>> listen_addresses = '*'
>>>>>> max_connections = '1050'
>>>>>> shared_buffers = '100GB'
>>>>>> work_mem = '64MB'
>>>>>> maintenance_work_mem = '512MB'
>>>>>> max_wal_size = '50GB'
>>>>>> min_wal_size = '10GB'
>>>>>> random_page_cost = '1.1'
>>>>>> wal_buffers = '1GB'
>>>>>> wal_level = 'minimal'
>>>>>> max_wal_senders = '0'
>>>>>> wal_sync_method = 'open_datasync'
>>>>>> wal_compression = 'lz4'
>>>>>> track_activities = 'off'
>>>>>> checkpoint_timeout = '1d'
>>>>>> checkpoint_completion_target = '0.95'
>>>>>> effective_cache_size = '300GB'
>>>>>> effective_io_concurrency = '32'
>>>>>> update_process_title = 'off'
>>>>>> password_encryption = 'md5'
>>>>>> huge_pages = 'on'
>>>>>>
>>>>>>>> Sorry for pause, it was my birthday, so I was on short vacation.
>>>>>>>> So, in total:
>>>>>>>> - increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
>>>>>>>> - additional lock attempts seems to help a bit in this benchmark,
>>>>>>>>       but it helps more in other (rather synthetic) benchmark [1]
>>>>>>>> - my version of lock-free reservation looks to help a bit when
>>>>>>>>       applied alone, but look strange in conjunction with additional
>>>>>>>>       lock attempts.
>>>>>>>> I don't see small improvement from my version of Lock-Free
>>>>>>>> reservation
>>>>>>>> (1.1% = 1023/1012) pays for its complexity at the moment.
>>>>>>>
>>>>>>> Due to limited hardware resources, I only had the opportunity to
>>>>>>> measure the performance impact of your v1 patch of the lock-free hash
>>>>>>> table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
>>>>>>> observed an improvement of *76.4%* (RSD: 4.1%) when combining them
>>>>>>> together on the SPR with 480 vCPUs. I understand that your test
>>>>>>> devices may not have as many cores, which might be why this
>>>>>>> optimization brings an unnoticeable impact. However, I don't think
>>>>>>> this is an unreal problem. In fact, this issue was raised by our
>>>>>>> customer who is trying to deploy Postgres on devices with hundreds of
>>>>>>> cores, and I believe the resolution of this performance issue would
>>>>>>> result in real impacts.
>>>>>>>
>>>>>>>> Probably, when other places will be optimized/improved, it will pay
>>>>>>>> more.
>>>>>>>> Or probably Zhiguo Zhou's version will perform better.
>>>>>>>>
>>>>>>>
>>>>>>> Our primary difference lies in the approach to handling the prev-link,
>>>>>>> either via the hash table or directly within the XLog buffer. During
>>>>>>> my analysis, I didn't identify significant hotspots in the hash table
>>>>>>> functions, leading me to believe that both implementations should
>>>>>>> achieve comparable performance improvements.
>>>>>>>
>>>>>>> Following your advice, I revised my implementation to update the
>>>>>>> prev-link atomically and resolved the known TAP tests. However, I
>>>>>>> encountered the last failure in the recovery/t/027_stream_regress.pl
>>>>>>> test. Addressing this issue might require a redesign of the underlying
>>>>>>> writing convention of XLog, which I believe is not necessary,
>>>>>>> especially since your implementation already achieves the desired
>>>>>>> performance improvements without suffering from the test failures. I
>>>>>>> think we may need to focus on your implementation in the next phase.
>>>>>>>
>>>>>>>> I think, we could measure theoretical benefit by completely ignoring
>>>>>>>> fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
>>>>>>>> measure. It passes almost all "recovery" tests, though fails two
>>>>>>>> strictly dependent on xl_prev.
>>>>>>>
>>>>>>
>>>>>>> I currently don't have access to the high-core-count device, but I
>>>>>>> plan to measure the performance impact of your latest patch and the
>>>>>>> "Dump-lock-free..." patch once I regain access.
>>>>>>>> [1] https://postgr.es/m/3b11fdc2-9793-403d-
>>>>>>>> b3d4-67ff9a00d447%40postgrespro.ru
>>>>>>>> ------
>>>>>>>> regards
>>>>>>>> Yura
>>>>>>>
>>>>>>> Hi Yura and Japin,
>>>>>>>
>>>>>>> Thanks so much for your recent patch works and discussions which
>>>>>>> inspired me a lot! I agree with you that we need to:
>>>>>>> - Align the test approach and environment
>>>>>>> - Address the motivation and necessity of this optimization
>>>>>>> - Further identify the optimization opportunities after applying
>>>>>>>      Yura's patch
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj
>>>>>>>
>>>>>>> Regards,
>>>>>>> Zhiguo
>>>>>>
>>>>>
>>>>> Hi Japin,
>>>>>
>>>>> Apologies for the delay in responding—I've just returned from
>>>>> vacation. To move things forward, I'll be running the BenchmarkSQL
>>>>> workload on my end shortly.
>>>>>
>>>>> In the meantime, could you run the HammerDB/TPCC workload on your
>>>>> device? We've observed significant performance improvements with this
>>>>> test, and it might help clarify whether the discrepancies we're seeing
>>>>> stem from the workload itself. Thanks!
>>>>>
>>>>
>>>> Sorry, I currently don't have access to the test device, I will try to test
>>>> it if I can regain access.
>>>>
>>>
>>> Good day, Yura and Japin!
>>>
>>> I recently acquired the SUT device again and had the opportunity to
>>> conduct performance experiments using the TPC-C benchmark (pg_count_ware
>>> 757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the
>>> results and key observations:
>>>
>>> +----------------+-------------+------------+-------------+------------+
>>> | Version        | NOPM        | NOPM Gain% | TPM         | TPM(Gain%) |
>>> +----------------+-------------+------------+-------------+------------+
>>> | master(b4a07f5)|  1,681,233  | 0.0%       |  3,874,491  | 0.0%       |
>>> | 64-lock        |  643,853    | -61.7%     |  1,479,647  | -61.8%     |
>>> | 64-lock-v4     |  2,423,972  | 44.2%      |  5,577,580  | 44.0%      |
>>> | 128-lock       |  462,993    | -72.5%     |  1,064,733  | -72.5%     |
>>> | 128-lock-v4    |  2,468,034  | 46.8%      |  5,673,349  | 46.4%      |
>>> +----------------+-------------+------------+-------------+------------+
>>>
>>> - Though the baseline (b4a07f5) has improved compared to when we created
>>> this mailing list, we still achieve 44% improvement with this optimization.
>>> - Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe
>>> performance regression due to intensified lock contention.
>>> - Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog
>>> insertion optimization jointly improve overall performance.
>>> - 64 locks seems the sweet spot for achieving the most performance
>>> improvement.
>>>
>>> I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not
>>> sure if the difference of their implementation of TPCC would lead to
>>> some performance gap). I observed that:
>>>
>>> - The performance indicator (NOPM) shows differences of several
>>> magnitudes compared to Japin's results.
>>> - NOPM/TPM seems insensitive to code changes (lock count increase,
>>> lock-free algorithm), which is quite strange.
>>> - Possible reasons may include: 1) scaling parameters [1] are not
>>> aligned, 2) test configuration did not reach the pain point of the XLog
>>> insertions.
>>>
>>> And I noticed a 64-core device (32 cores for the server) was used in
>>> Japin's test. In our previous core-scaling test (attached), 32/64 cores
>>> may not be enough to show the impact of the optimization, I think that
>>> would be one of the reason why Japin observed minimal impact from the
>>> lock-free optimization.
>>>
>>> In summary, I think:
>>> - The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB
>>> effectively reflects performance in XLog insertions.
>>> - This test on a device with hundreds of cores reflects a real user
>>> scenario, making it a significant consideration.
>>> - The lock-free algorithm with the lock count increased to 64 can bring
>>> significant performance improvements.
>>>
>>> So I propose to continue the code review process for this optimization
>>> patch. WDYT?
>>>
>>> [1]
>>> https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters
>>
>> Good day.
>>
>> I'll just repeat my answer from personal mail:
>>
>> I'm impressed with results. I really didn't expect it is so important for
>> huge servers.
>>
>> Main problem will be to prove this patch doesn't harm performance on
>> smaller servers. Or made things configurable so that smaller server still
>> uses simpler code.
>>
>> -------
>> regards
>> Yura Sokolov aka funny-falcon
> 
> Good day, Yura!
> 
> Firstly, I'd apologize for the delayed response due to internal 
> urgencies and the time required to set up the tests on a new device.
> 
> Regarding your concerns about the potential negative impact on 
> performance, I have conducted further evaluations. To assess the 
> performance impact of the patch on a smaller device, I located another 
> device with significantly fewer processors. Using the same database and 
> test configurations (TPC-C benchmark: pg_count_ware 757, vu 256) and 
> code bases (b4a07f5 as "base" and v4 patch with 64 locks as "opt"), I 
> performed core scaling tests ranging from 8 to 64 physical cores in 
> steps of 8. The results (attached) indicate that the optimization does 
> not lead to performance regression within this low core count range.
> 
> Please kindly let me know if more data is required to move the process
> forward.
> 
> I look forward to your insights.

Good day, Zhiguo.

Thank you a lot for testing!
I will validate on servers I have access too (and on notebook).

To be honestly, I didn't bench v4, and it fixes cache-line sharing issue i
mistakenly introduced in previous version. So probably it is really doesn't
affect performance as v3 did.

-------
regards
Yura Sokolov aka funny-falcon
pgsql-hackers by date:
From: Jakub Wartak
Date: 05 March, 12:30:02
Subject: Re: Draft for basic NUMA observability
From: Xuneng Zhou
Date: 05 March, 12:45:57
Subject: Re: per backend WAL statistics
Re: [RFC] Lock-free XLog Reservation from WAL - Mailing list pgsql-hackers

Previous

Next