Thread: Re: Get rid of WALBufMappingLock
On Fri, Feb 7, 2025 at 1:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > 07.02.2025 14:02, Yura Sokolov пишет: > > 07.02.2025 01:26, Alexander Korotkov пишет: > >> Hi! > >> > >> On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > >>> > >>> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund > >>> used benchmark which creates WAL records very intensively. While I this > >>> it is not completely fair (1MB log records are really rare), it pushed > >>> me to analyze write-side waiting of XLog machinery. > >>> > >>> First I tried to optimize WaitXLogInsertionsToFinish, but without great > >>> success (yet). > >>> > >>> While profiling, I found a lot of time is spend in the memory clearing > >>> under global WALBufMappingLock: > >>> > >>> MemSet((char *) NewPage, 0, XLOG_BLCKSZ); > >>> > >>> It is obvious scalability bottleneck. > >>> > >>> So "challenge was accepted". > >>> > >>> Certainly, backend should initialize pages without exclusive lock. But > >>> which way to ensure pages were initialized? In other words, how to > >>> ensure XLogCtl->InitializedUpTo is correct. > >>> > >>> I've tried to play around WALBufMappingLock with holding it for a short > >>> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found > >>> WALBufMappingLock is useless at all. > >>> > >>> Instead of holding lock, it is better to allow backends to cooperate: > >>> - I bound ConditionVariable to each xlblocks entry, > >>> - every backend now checks every required block pointed by > >>> InitializedUpto was successfully initialized or sleeps on its condvar, > >>> - when backend sure block is initialized, it tries to update > >>> InitializedUpTo with conditional variable. > >> > >> Looks reasonable for me, but having ConditionVariable per xlog buffer > >> seems overkill for me. Find an attached revision, where I've > >> implemented advancing InitializedUpTo without ConditionVariable. > >> After initialization of each buffer there is attempt to do CAS for > >> InitializedUpTo in a loop. So, multiple processes will try to advance > >> InitializedUpTo, they could hijack initiative from each other, but > >> there is always a leader which will finish the work. > >> > >> There is only one ConditionVariable to wait for InitializedUpTo being advanced. > >> > >> I didn't benchmark my version, just checked that tests passed. > > > > Good day, Alexander. > > Seems I was mistaken twice. > > > I've got mixed but quite close result for both approaches (single or many > > ConditionVariable) on the notebook. Since I have no access to larger > > machine, I can't prove "many" is way better (or discover it worse). > > 1. "many condvars" (my variant) is strictly worse with num locks = 64 and > when pg_logical_emit_message emits just 1kB instead of 1MB. > > Therefore, "single condvar" is strictly better. > > > Given patch after cleanup looks a bit smaller and clearer, I agree to keep > > just single condition variable. > > > > Cleaned version is attached. > > > > I've changed condition for broadcast a bit ("less" instead "not equal"): > > - buffer's border may already go into future, > > - and then other backend will reach not yet initialized buffer and will > > broadcast. > > 2. I've inserted abort if "buffer's border went into future", and wasn't > able to trigger it. > > So I returned update-loop's body to your variant. Good, thank you. I think 0001 patch is generally good, but needs some further polishing, e.g. more comments explaining how does it work. Regarding 0002 patch, it looks generally reasonable. But are 2 attempts always optimal? Are there cases of regression, or cases when more attempts are even better? Could we have there some self-adjusting mechanism like what we have for spinlocks? ------ Regards, Alexander Korotkov Supabase
On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Good, thank you. I think 0001 patch is generally good, but needs some > further polishing, e.g. more comments explaining how does it work. Two things comes to my mind worth rechecking about 0001. 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and XLogCtl->xlblocks always page-aligned? Because algorithm seems to be sensitive to that. If so, I would propose to explicitly comment that and add corresponding asserts. 2) Check if there are concurrency issues between AdvanceXLInsertBuffer() and switching to the new WAL file. ------ Regards, Alexander Korotkov Supabase
13.02.2025 12:34, Alexander Korotkov пишет: > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: >> 08.02.2025 13:07, Alexander Korotkov пишет: >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: >>>> Good, thank you. I think 0001 patch is generally good, but needs some >>>> further polishing, e.g. more comments explaining how does it work. >> >> I tried to add more comments. I'm not good at, so recommendations are welcome. >> >>> Two things comes to my mind worth rechecking about 0001. >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be >>> sensitive to that. If so, I would propose to explicitly comment that >>> and add corresponding asserts. >> >> They're certainly page aligned, since they are page borders. >> I added assert on alignment of InitializeReserved for the sanity. >> >>> 2) Check if there are concurrency issues between >>> AdvanceXLInsertBuffer() and switching to the new WAL file. >> >> There are no issues: >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses >> GetXLogBuffer to zero-out WAL page. >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, >> so switching wal is not concurrent. (Although, there is no need in this >> exclusiveness, imho.) > > Good, thank you. I've also revised commit message and comments. > > But I see another issue with this patch. In the worst case, we do > XLogWrite() by ourselves, and it could potentially could error out. > Without patch, that would cause WALBufMappingLock be released and > XLogCtl->InitializedUpTo not advanced. With the patch, that would > cause other processes infinitely waiting till we finish the > initialization. > > Possible solution would be to save position of the page to be > initialized, and set it back to XLogCtl->InitializeReserved on error > (everywhere we do LWLockReleaseAll()). We also must check that on > error we only set XLogCtl->InitializeReserved to the past, because > there could be multiple concurrent failures. Also we need to > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. The single place where AdvanceXLInsertBuffer is called outside of critical section is in XLogBackgroundFlush. All other call stacks will issue server restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. XLogBackgroundFlush explicitely avoids writing buffers by passing opportunistic=true parameter. Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer since server will shutdown/restart. Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call to XLogWrite here? ------- regards Yura Sokolov aka funny-falcon
Hi, Yura and Alexander! On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > 13.02.2025 12:34, Alexander Korotkov пишет: > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > >> 08.02.2025 13:07, Alexander Korotkov пишет: > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > >>>> Good, thank you. I think 0001 patch is generally good, but needs some > > >>>> further polishing, e.g. more comments explaining how does it work. > > >> > > >> I tried to add more comments. I'm not good at, so recommendations are welcome. > > >> > > >>> Two things comes to my mind worth rechecking about 0001. > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > > >>> sensitive to that. If so, I would propose to explicitly comment that > > >>> and add corresponding asserts. > > >> > > >> They're certainly page aligned, since they are page borders. > > >> I added assert on alignment of InitializeReserved for the sanity. > > >> > > >>> 2) Check if there are concurrency issues between > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file. > > >> > > >> There are no issues: > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > > >> GetXLogBuffer to zero-out WAL page. > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, > > >> so switching wal is not concurrent. (Although, there is no need in this > > >> exclusiveness, imho.) > > > > > > Good, thank you. I've also revised commit message and comments. > > > > > > But I see another issue with this patch. In the worst case, we do > > > XLogWrite() by ourselves, and it could potentially could error out. > > > Without patch, that would cause WALBufMappingLock be released and > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would > > > cause other processes infinitely waiting till we finish the > > > initialization. > > > > > > Possible solution would be to save position of the page to be > > > initialized, and set it back to XLogCtl->InitializeReserved on error > > > (everywhere we do LWLockReleaseAll()). We also must check that on > > > error we only set XLogCtl->InitializeReserved to the past, because > > > there could be multiple concurrent failures. Also we need to > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > > > > The single place where AdvanceXLInsertBuffer is called outside of critical > > section is in XLogBackgroundFlush. All other call stacks will issue server > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. > > > > XLogBackgroundFlush explicitely avoids writing buffers by passing > > opportunistic=true parameter. > > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer > > since server will shutdown/restart. > > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call > > to XLogWrite here? > > You're correct. I just reflected this in the next revision of the patch. I've looked at the patchset v6. The overall goal to avoid locking looks good. Both patches look right to me. The only thing I'm slightly concerned about if patchset could demonstrate performance differences in any real workload. I appreciate it as a beautiful optimization anyway. I have the following proposals to change the code and comments: For patch 0001: - Struct XLBlocks contains just one pg_atomic_uint64 member. Is it still needed as a struct? These changes make a significant volume of changes to the patch, being noop. Maybe it was inherited from v1 and not needed anymore. - Furthermore when xlblocks became a struct comments like: > and xlblocks values certainly do. xlblocks values are changed need to be changed to xlblocks.bound. This could be avoided by changing back xlblocks from type XLBlocks * to pg_atomic_uint64 * - It's worth more detailed commenting InitializedUpTo/InitializedUpToCondVar than: + * It is updated to successfully initialized buffer's identities, perhaps + * waiting on conditional variable bound to buffer. "perhaps waiting" could also be in style "maybe/even while AAA waits BBB" "lock-free with cooperation with" -> "lock-free accompanied by changes to..." - Comment inside typedef struct XLogCtlData: /* 1st byte ptr-s + XLOG_BLCKSZ (and condvar * for) */ need to be returned back /* 1st byte ptr-s + XLOG_BLCKSZ */ - Commented out code for cleanup in the final patch: //ConditionVariableBroadcast(&XLogCtl->xlblocks[nextidx].condvar); - in AdvanceXLInsertBuffer() npages initialised to 0 but it is not increased anymore Block under > if (XLOG_DEBUG && npages > 0) became unreachable (InvalidXLogRecPtr + 1) is essentially 0+1 and IMO this semantically calls for adding #define FirstValidXLogRecPtr 1 Typo in a commit message: %s/usign/using/g For patch 0002: I think Yura's explanation from above in this thread need to get place in a commit message and in a comment to this: > int attempts = 2; Comments around: "try another lock next time" could be modified to reflect that we do repeat twice Kind regards, Pavel Borisov Supabase
Hi, Alexander! On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi, Pavel! > > On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > > On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > > > 13.02.2025 12:34, Alexander Korotkov пишет: > > > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > > > >> 08.02.2025 13:07, Alexander Korotkov пишет: > > > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > >>>> Good, thank you. I think 0001 patch is generally good, but needs some > > > > >>>> further polishing, e.g. more comments explaining how does it work. > > > > >> > > > > >> I tried to add more comments. I'm not good at, so recommendations are welcome. > > > > >> > > > > >>> Two things comes to my mind worth rechecking about 0001. > > > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > > > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > > > > >>> sensitive to that. If so, I would propose to explicitly comment that > > > > >>> and add corresponding asserts. > > > > >> > > > > >> They're certainly page aligned, since they are page borders. > > > > >> I added assert on alignment of InitializeReserved for the sanity. > > > > >> > > > > >>> 2) Check if there are concurrency issues between > > > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file. > > > > >> > > > > >> There are no issues: > > > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > > > > >> GetXLogBuffer to zero-out WAL page. > > > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, > > > > >> so switching wal is not concurrent. (Although, there is no need in this > > > > >> exclusiveness, imho.) > > > > > > > > > > Good, thank you. I've also revised commit message and comments. > > > > > > > > > > But I see another issue with this patch. In the worst case, we do > > > > > XLogWrite() by ourselves, and it could potentially could error out. > > > > > Without patch, that would cause WALBufMappingLock be released and > > > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would > > > > > cause other processes infinitely waiting till we finish the > > > > > initialization. > > > > > > > > > > Possible solution would be to save position of the page to be > > > > > initialized, and set it back to XLogCtl->InitializeReserved on error > > > > > (everywhere we do LWLockReleaseAll()). We also must check that on > > > > > error we only set XLogCtl->InitializeReserved to the past, because > > > > > there could be multiple concurrent failures. Also we need to > > > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > > > > > > > > The single place where AdvanceXLInsertBuffer is called outside of critical > > > > section is in XLogBackgroundFlush. All other call stacks will issue server > > > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. > > > > > > > > XLogBackgroundFlush explicitely avoids writing buffers by passing > > > > opportunistic=true parameter. > > > > > > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer > > > > since server will shutdown/restart. > > > > > > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call > > > > to XLogWrite here? > > > > > > You're correct. I just reflected this in the next revision of the patch. > > > > I've looked at the patchset v6. > > Oh, sorry, I really did wrong. I've done git format-patch for wrong > local branch for v5 and v6. Patches I've sent for v5 and v6 are > actually the same as my v1. This is really pity. Please, find the > right version of patchset attached. I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it landed in v7. Other changes are not regarding code behavior. The things from my previous review that still could apply to v7: For 0001: Comment change proposed: "lock-free with cooperation with" -> "lock-free accompanied by changes to..." (maybe other variant) I propose a new define: #define FirstValidXLogRecPtr 1 While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code that has no semantical meaning and it's better to avoid using direct arithmetics to relate meaning of FirstValidXLogRecPtr from InvalidXLogRecPtr. For 0002 both comments proposals from my message applied to v6 apply to v7 as well [1] https://www.postgresql.org/message-id/d6799557-e352-42c8-80cc-ed36e3b8893c%40postgrespro.ru Regards, Pavel Borisov Supabase
Hi! On Fri, Feb 14, 2025 at 4:11 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > 14.02.2025 17:09, Yura Sokolov пишет: > > 14.02.2025 13:24, Alexander Korotkov пишет: > >> On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > >>> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote: > >>>> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > >>>>> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote: > >>>>>> > >>>>>> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > >>>>>>> 13.02.2025 12:34, Alexander Korotkov пишет: > >>>>>>>> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > >>>>>>>>> 08.02.2025 13:07, Alexander Korotkov пишет: > >>>>>>>>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > >>>>>>>>>>> Good, thank you. I think 0001 patch is generally good, but needs some > >>>>>>>>>>> further polishing, e.g. more comments explaining how does it work. > >>>>>>>>> > >>>>>>>>> I tried to add more comments. I'm not good at, so recommendations are welcome. > >>>>>>>>> > >>>>>>>>>> Two things comes to my mind worth rechecking about 0001. > >>>>>>>>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > >>>>>>>>>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > >>>>>>>>>> sensitive to that. If so, I would propose to explicitly comment that > >>>>>>>>>> and add corresponding asserts. > >>>>>>>>> > >>>>>>>>> They're certainly page aligned, since they are page borders. > >>>>>>>>> I added assert on alignment of InitializeReserved for the sanity. > >>>>>>>>> > >>>>>>>>>> 2) Check if there are concurrency issues between > >>>>>>>>>> AdvanceXLInsertBuffer() and switching to the new WAL file. > >>>>>>>>> > >>>>>>>>> There are no issues: > >>>>>>>>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > >>>>>>>>> GetXLogBuffer to zero-out WAL page. > >>>>>>>>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, > >>>>>>>>> so switching wal is not concurrent. (Although, there is no need in this > >>>>>>>>> exclusiveness, imho.) > >>>>>>>> > >>>>>>>> Good, thank you. I've also revised commit message and comments. > >>>>>>>> > >>>>>>>> But I see another issue with this patch. In the worst case, we do > >>>>>>>> XLogWrite() by ourselves, and it could potentially could error out. > >>>>>>>> Without patch, that would cause WALBufMappingLock be released and > >>>>>>>> XLogCtl->InitializedUpTo not advanced. With the patch, that would > >>>>>>>> cause other processes infinitely waiting till we finish the > >>>>>>>> initialization. > >>>>>>>> > >>>>>>>> Possible solution would be to save position of the page to be > >>>>>>>> initialized, and set it back to XLogCtl->InitializeReserved on error > >>>>>>>> (everywhere we do LWLockReleaseAll()). We also must check that on > >>>>>>>> error we only set XLogCtl->InitializeReserved to the past, because > >>>>>>>> there could be multiple concurrent failures. Also we need to > >>>>>>>> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > >>>>>>> > >>>>>>> The single place where AdvanceXLInsertBuffer is called outside of critical > >>>>>>> section is in XLogBackgroundFlush. All other call stacks will issue server > >>>>>>> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. > >>>>>>> > >>>>>>> XLogBackgroundFlush explicitely avoids writing buffers by passing > >>>>>>> opportunistic=true parameter. > >>>>>>> > >>>>>>> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer > >>>>>>> since server will shutdown/restart. > >>>>>>> > >>>>>>> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call > >>>>>>> to XLogWrite here? > >>>>>> > >>>>>> You're correct. I just reflected this in the next revision of the patch. > >>>>> > >>>>> I've looked at the patchset v6. > >>>> > >>>> Oh, sorry, I really did wrong. I've done git format-patch for wrong > >>>> local branch for v5 and v6. Patches I've sent for v5 and v6 are > >>>> actually the same as my v1. This is really pity. Please, find the > >>>> right version of patchset attached. > >>> > >>> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it > >>> landed in v7. > >>> > >>> Other changes are not regarding code behavior. The things from my > >>> previous review that still could apply to v7: > >>> > >>> For 0001: > >>> > >>> Comment change proposed: > >>> "lock-free with cooperation with" -> "lock-free accompanied by changes > >>> to..." (maybe other variant) > >> > >> Good catch. I've rephrased this comment even more. > >> > >>> I propose a new define: > >>> #define FirstValidXLogRecPtr 1 > >>> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code > >>> that has no semantical meaning and it's better to avoid using direct > >>> arithmetics to relate meaning of FirstValidXLogRecPtr from > >>> InvalidXLogRecPtr. > >> > >> Makes sense, but I'm not sure if this change is required at all. I've > >> reverted this to the state of master, and everything seems to work. > >> > >>> For 0002 both comments proposals from my message applied to v6 apply > >>> to v7 as well > >> > >> Thank you for pointing. For now, I'm concentrated on improvements on > >> 0001. Probably Yura could work on your notes to 0002. > > > > I wrote good commit message for 0002 with calculated probabilities and > > simple Ruby program which calculates them to explain choice of 2 > > conditional attempts. (At least I hope the message is good). And added > > simple comment before `int attempts = 2;` > > > > Also I simplified 0002 a bit to look a bit prettier (ie without goto), and > > added static assert on NUM_XLOGINSERT_LOCKS being power of 2. > > > > (0001 patch is same as for v8) > > Oops, forgot to add StaticAssert into v9-0002. Thank you. I'm planning to push 0001 if there is no objections. And I'm planning to do more review/revision of 0002. ------ Regards, Alexander Korotkov Supabase
Hi! I spotted a typo in v10: + /* + * Page at nextidx wasn't initialized yet, so we cann't move + * InitializedUpto further. It will be moved by backend which + * will initialize nextidx. + */ cann't - > can't moved by backend -> moved by the backend -- Best regards, Kirill Reshke
Hey.
I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually.
Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead?
Victor Yegorov
Hi, Victor! On Mon, 17 Feb 2025 at 12:47, Victor Yegorov <vyegorov@gmail.com> wrote: > > Hey. > > I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually. > > Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead? The patch replaces WALBufMappingLock with a lockless algorithm based on atomic variables and CV. Mentioning only CV in the head is only a part of implementation. Also, the header should better reflect what is done on the whole, than the implementation details. So I'd rather see a header like "Replace WALBufMappingLock by lockless algorithm" or "Initialize WAL buffers concurrently without using WALBufMappingLock" or something like that. Kind regards, Pavel Borisov Supabase
On Mon, 17 Feb 2025 at 13:20, Pavel Borisov <pashkin.elfe@gmail.com> wrote: > > Hi, Victor! > > On Mon, 17 Feb 2025 at 12:47, Victor Yegorov <vyegorov@gmail.com> wrote: > > > > Hey. > > > > I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually. > > > > Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead? > > The patch replaces WALBufMappingLock with a lockless algorithm based > on atomic variables and CV. Mentioning only CV in the head is only a > part of implementation. Also, the header should better reflect what is > done on the whole, than the implementation details. So I'd rather see > a header like "Replace WALBufMappingLock by lockless algorithm" or > "Initialize WAL buffers concurrently without using WALBufMappingLock" > or something like that. Update: I see the patch is already committed, so we're late with the naming proposals. I don't see problem with existing commit message TBH. Kind regards, Pavel Borisov
On Mon, Feb 17, 2025 at 11:44 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > Oops, I send wrong patch as a fix. > The right one is attached. > > Pavel I've spotted the failure on the buildfarm. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2025-02-17%2008%3A05%3A03 I can't quickly guess the reason. I'm going to revert patch for now, then we investigate ------ Regards, Alexander Korotkov Supabase
Alexander Korotkov <aekorotkov@gmail.com> writes: > I've spotted the failure on the buildfarm. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2025-02-17%2008%3A05%3A03 > I can't quickly guess the reason. I'm going to revert patch for now, > then we investigate This timeout failure on hachi looks suspicious as well: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03 Might be relevant that they are both aarch64? regards, tom lane
On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote: > > This timeout failure on hachi looks suspicious as well: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03 > > > > Might be relevant that they are both aarch64? > > Just logged into the host. The logs of the timed out run are still > around, and the last information I can see is from lastcommand.log, > which seems to have frozen in time when the timeout has begun its > vacuuming work: > ok 73 + index_including_gist 353 ms > # parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse > > gokiburi is on the same host, and it is currently frozen in time when > trying to fetch a WAL buffer. One of the stack traces: > #2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0, > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at > latch.c:1571 > #3 WaitEventSetWait (set=0xd34ce0, timeout=-1, > occurred_events=occurred_events@entry=0xffffffffadd8, > nevents=nevents@entry=1, wait_event_info=<optimized out>, > wait_event_info@entry=134217781) at latch.c:1519 > #4 0x000000000084e964 in WaitLatch (latch=<optimized out>, > wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, > wait_event_info=wait_event_info@entry=134217781) at latch.c:538 > #5 0x000000000085d2f8 in ConditionVariableTimedSleep > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at > condition_variable.c:163 > #6 0x000000000085d1ec in ConditionVariableSleep > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98 > #7 0x000000000055f4f4 in AdvanceXLInsertBuffer > (upto=upto@entry=112064880, tli=tli@entry=1, opportunistic=false) at > xlog.c:2224 > #8 0x0000000000568398 in GetXLogBuffer (ptr=ptr@entry=112064880, > tli=tli@entry=1) at xlog.c:1710 > #9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80, > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>, > EndPos=<optimized out>, tli=1) at xlog.c:1245 > #10 XLogInsertRecord (rdata=rdata@entry=0xcc49b0 <hdr_rdt>, > fpw_lsn=fpw_lsn@entry=112025520, flags=0 '\000', num_fpi=<optimized > out>, num_fpi@entry=0, topxid_included=false) at xlog.c:928 > #11 0x000000000056b870 in XLogInsert (rmid=rmid@entry=16 '\020', > info=<optimized out>, info@entry=16 '\020') at xloginsert.c:523 > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950, > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>, > parent=<optimized out>, > > So, yes, something looks really wrong with this patch. Sounds > plausible to me that some other buildfarm animals could be stuck > without their owners knowing about it. It's proving to be a good idea > to force a timeout value in the configuration file of these animals.. Tom, Michael, thank you for the information. This patch will be better tested before next attempt. ------ Regards, Alexander Korotkov Supabase
On Tue, Feb 18, 2025 at 2:29 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote: > > > This timeout failure on hachi looks suspicious as well: > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03 > > > > > > Might be relevant that they are both aarch64? > > > > Just logged into the host. The logs of the timed out run are still > > around, and the last information I can see is from lastcommand.log, > > which seems to have frozen in time when the timeout has begun its > > vacuuming work: > > ok 73 + index_including_gist 353 ms > > # parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse > > > > gokiburi is on the same host, and it is currently frozen in time when > > trying to fetch a WAL buffer. One of the stack traces: > > #2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0, > > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at > > latch.c:1571 > > #3 WaitEventSetWait (set=0xd34ce0, timeout=-1, > > occurred_events=occurred_events@entry=0xffffffffadd8, > > nevents=nevents@entry=1, wait_event_info=<optimized out>, > > wait_event_info@entry=134217781) at latch.c:1519 > > #4 0x000000000084e964 in WaitLatch (latch=<optimized out>, > > wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, > > wait_event_info=wait_event_info@entry=134217781) at latch.c:538 > > #5 0x000000000085d2f8 in ConditionVariableTimedSleep > > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at > > condition_variable.c:163 > > #6 0x000000000085d1ec in ConditionVariableSleep > > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98 > > #7 0x000000000055f4f4 in AdvanceXLInsertBuffer > > (upto=upto@entry=112064880, tli=tli@entry=1, opportunistic=false) at > > xlog.c:2224 > > #8 0x0000000000568398 in GetXLogBuffer (ptr=ptr@entry=112064880, > > tli=tli@entry=1) at xlog.c:1710 > > #9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80, > > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>, > > EndPos=<optimized out>, tli=1) at xlog.c:1245 > > #10 XLogInsertRecord (rdata=rdata@entry=0xcc49b0 <hdr_rdt>, > > fpw_lsn=fpw_lsn@entry=112025520, flags=0 '\000', num_fpi=<optimized > > out>, num_fpi@entry=0, topxid_included=false) at xlog.c:928 > > #11 0x000000000056b870 in XLogInsert (rmid=rmid@entry=16 '\020', > > info=<optimized out>, info@entry=16 '\020') at xloginsert.c:523 > > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950, > > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>, > > parent=<optimized out>, > > > > So, yes, something looks really wrong with this patch. Sounds > > plausible to me that some other buildfarm animals could be stuck > > without their owners knowing about it. It's proving to be a good idea > > to force a timeout value in the configuration file of these animals.. > > Tom, Michael, thank you for the information. > This patch will be better tested before next attempt. It seems that I managed to reproduce the issue on my Raspberry PI 4. After running our test suite in a loop for 2 days I found one timeout. I have hypothesis on why it might happen. We don't have protection against two backends in parallel get ReservedPtr mapped to a single XLog buffer. I've talked to Yura off-list about that. He pointer out that XLogWrite() should issue a PANIC in that case, which we didn't observe. However, I'm not sure this analysis is complete. One way or another, we need protection against this situation any way. The updated patch is attached. Now, after acquiring ReservedPtr it waits till OldPageRqstPtr gets initialized. Additionally I've to implement more accurate calculation of OldPageRqstPtr. I run tests with new patch on my Raspberry in a loop. Let's see how it goes. ------ Regards, Alexander Korotkov Supabase
Attachment
On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote: > It seems that I managed to reproduce the issue on my Raspberry PI 4. > After running our test suite in a loop for 2 days I found one timeout. Hmm. It's surprising to not see a higher occurence. My buildfarm host has caught that on its first run after the patch, for two different animals which are both on the same machine. > One way or another, we need protection against this situation any way. > The updated patch is attached. Now, after acquiring ReservedPtr it > waits till OldPageRqstPtr gets initialized. Additionally I've to > implement more accurate calculation of OldPageRqstPtr. I run tests > with new patch on my Raspberry in a loop. Let's see how it goes. Perhaps you'd prefer that I do more tests with your patch? This is time-consuming for you. This is not a review of the internals of the patch, and I cannot give you access to the host, but if my stuff is the only place where we have a good reproducibility of the issue, I'm OK to grab some time and run a couple of checks to avoid again a freeze of the buildfarm. -- Michael
Attachment
> On 25 Feb 2025, at 20:19, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi! One little piece of code looks suspicious to me. But I was not raising concern because I see similar code everywhere in thecodebase. But know Kirill asked to me explain what is going on and I cannot. This seems to be relevant… so. + while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo)) // Assume ConditionVariableBroadcast() happened here, but before next line + ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT); + ConditionVariableCancelSleep(); Won’t this sleep wait forever? I see about 20 other occurrences of similar code, so, perhaps, everything is fine. But I would greatly appreciate a littlepointers on why it works. Best regards, Andrey Borodin.
Hi, Michael! On Wed, Feb 26, 2025 at 3:04 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote: > > It seems that I managed to reproduce the issue on my Raspberry PI 4. > > After running our test suite in a loop for 2 days I found one timeout. > > Hmm. It's surprising to not see a higher occurence. My buildfarm > host has caught that on its first run after the patch, for two > different animals which are both on the same machine. > > > One way or another, we need protection against this situation any way. > > The updated patch is attached. Now, after acquiring ReservedPtr it > > waits till OldPageRqstPtr gets initialized. Additionally I've to > > implement more accurate calculation of OldPageRqstPtr. I run tests > > with new patch on my Raspberry in a loop. Let's see how it goes. > > Perhaps you'd prefer that I do more tests with your patch? This is > time-consuming for you. This is not a review of the internals of the > patch, and I cannot give you access to the host, but if my stuff is > the only place where we have a good reproducibility of the issue, I'm > OK to grab some time and run a couple of checks to avoid again a > freeze of the buildfarm. Thank you for offering the help. Updated version of patch is attached (I've added one memory barrier there just in case). I would appreciate if you could run it on batta, hachi or similar hardware. ------ Regards, Alexander Korotkov Supabase
Attachment
On Wed, Feb 26, 2025 at 01:48:47PM +0200, Alexander Korotkov wrote: > Thank you for offering the help. Updated version of patch is attached > (I've added one memory barrier there just in case). I would > appreciate if you could run it on batta, hachi or similar hardware. Doing a revert of the revert done in 3fb58625d18f proves that reproducing the error is not really difficult. I've done a make installcheck-world USE_MODULE_DB=1 -j N without assertions, and the point that saw a failure quickly is one of the tests of pgbench: PANIC: could not find WAL buffer for 0/19D366 This one happened for the test "concurrent OID generation" and CREATE TYPE. Of course, as it is a race condition, it is random, but it's taking me only a couple of minutes to see the original issue on my buildfarm host. With assertion failures enabled, same story, and same failure from the pgbench TAP test. Saying that, I have also done similar tests with your v12 for a couple of hours and this looks stable under installcheck-world. I can see that you've reworked quite a bit the surroundings of InitializedFrom in this one. If you apply that once again at some point, the buildfarm will be judge in the long-term, but I am rather confident by saying that the situation looks better here, at least. One thing I would consider doing if you want to gain confidence is tests like the one I saw causing problems with pgbench, with DDL patterns stressing specific paths like this CREATE TYPE case. -- Michael
Attachment
26.02.2025 11:52, Andrey Borodin wrote: >> On 25 Feb 2025, at 20:19, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> >> > > > Hi! > > One little piece of code looks suspicious to me. But I was not raising concern because I see similar code everywhere inthe codebase. But know Kirill asked to me explain what is going on and I cannot. > > This seems to be relevant… so. > > + while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo)) > // Assume ConditionVariableBroadcast() happened here, but before next line > + ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT); > + ConditionVariableCancelSleep(); > > Won’t this sleep wait forever? Because ConditionVariableSleep doesn't sleep for the first time. It just performs ConditionVariablePrepareToSleep and immediately returns. So actual condition of `while` loop is checked at least twice before going to sleep. > > I see about 20 other occurrences of similar code, so, perhaps, everything is fine. But I would greatly appreciate a littlepointers on why it works. ------- regards Yura Sokolov aka funny-falcon
26.02.2025 14:48, Alexander Korotkov пишет: > Hi, Michael! > > On Wed, Feb 26, 2025 at 3:04 AM Michael Paquier <michael@paquier.xyz> wrote: >> >> On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote: >>> It seems that I managed to reproduce the issue on my Raspberry PI 4. >>> After running our test suite in a loop for 2 days I found one timeout. >> >> Hmm. It's surprising to not see a higher occurence. My buildfarm >> host has caught that on its first run after the patch, for two >> different animals which are both on the same machine. >> >>> One way or another, we need protection against this situation any way. >>> The updated patch is attached. Now, after acquiring ReservedPtr it >>> waits till OldPageRqstPtr gets initialized. Additionally I've to >>> implement more accurate calculation of OldPageRqstPtr. I run tests >>> with new patch on my Raspberry in a loop. Let's see how it goes. >> >> Perhaps you'd prefer that I do more tests with your patch? This is >> time-consuming for you. This is not a review of the internals of the >> patch, and I cannot give you access to the host, but if my stuff is >> the only place where we have a good reproducibility of the issue, I'm >> OK to grab some time and run a couple of checks to avoid again a >> freeze of the buildfarm. > > Thank you for offering the help. Updated version of patch is attached > (I've added one memory barrier there just in case). I would > appreciate if you could run it on batta, hachi or similar hardware. Good day, Alexander. Checked your additions to patch. They're clear and robust. ------- regards Yura Sokolov aka funny-falcon
On 2025-Feb-28, Michael Paquier wrote: > Saying that, I have also done similar tests with your v12 for a couple > of hours and this looks stable under installcheck-world. I can see > that you've reworked quite a bit the surroundings of InitializedFrom > in this one. If you apply that once again at some point, the > buildfarm will be judge in the long-term, but I am rather confident by > saying that the situation looks better here, at least. Heh, no amount of testing can prove lack of bugs; but for sure "it looks different now, so it must be correct" must be the weakest proof of correctness I've heard of! -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) https://postgr.es/m/200606261359.k5QDxE2p004593@auth-smtp.hol.gr
On Fri, Feb 28, 2025 at 02:13:23PM +0100, Álvaro Herrera wrote: > On 2025-Feb-28, Michael Paquier wrote: >> Saying that, I have also done similar tests with your v12 for a couple >> of hours and this looks stable under installcheck-world. I can see >> that you've reworked quite a bit the surroundings of InitializedFrom >> in this one. If you apply that once again at some point, the >> buildfarm will be judge in the long-term, but I am rather confident by >> saying that the situation looks better here, at least. > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks > different now, so it must be correct" must be the weakest proof of > correctness I've heard of! Err, okay. I did use the word "stable" with tests rather than "correct", and I implied upthread that I did not check the correctness nor the internals of the patch. If my words held the meaning you are implying, well, my apologies for the confusion, I guess. I only tested the patch and it was stable while I've noticed a few diffs with the previous version, but I did *not* check its internals at all, nor do I mean that I endorse its logic. I hope that's clear now. -- Michael
Attachment
On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2025-Feb-28, Michael Paquier wrote: > > > Saying that, I have also done similar tests with your v12 for a couple > > of hours and this looks stable under installcheck-world. I can see > > that you've reworked quite a bit the surroundings of InitializedFrom > > in this one. If you apply that once again at some point, the > > buildfarm will be judge in the long-term, but I am rather confident by > > saying that the situation looks better here, at least. > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks > different now, so it must be correct" must be the weakest proof of > correctness I've heard of! Michael just volunteered to help Yura and me with testing. He wan't intended to be reviewer. And he reported that tests looks much more stable now. I think he is absolutely correct with this. ------ Regards, Alexander Korotkov Supabase
On Fri, Feb 28, 2025 at 3:44 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Feb 28, 2025 at 02:13:23PM +0100, Álvaro Herrera wrote: > > On 2025-Feb-28, Michael Paquier wrote: > >> Saying that, I have also done similar tests with your v12 for a couple > >> of hours and this looks stable under installcheck-world. I can see > >> that you've reworked quite a bit the surroundings of InitializedFrom > >> in this one. If you apply that once again at some point, the > >> buildfarm will be judge in the long-term, but I am rather confident by > >> saying that the situation looks better here, at least. > > > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks > > different now, so it must be correct" must be the weakest proof of > > correctness I've heard of! > > Err, okay. I did use the word "stable" with tests rather than > "correct", and I implied upthread that I did not check the correctness > nor the internals of the patch. If my words held the meaning you > are implying, well, my apologies for the confusion, I guess. I only > tested the patch and it was stable while I've noticed a few diffs with > the previous version, but I did *not* check its internals at all, nor > do I mean that I endorse its logic. I hope that's clear now. Got it. Michael, thank you very much for your help. ------ Regards, Alexander Korotkov Supabase
On Sun, Mar 2, 2025 at 1:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2025-Feb-28, Michael Paquier wrote: > > > > > Saying that, I have also done similar tests with your v12 for a couple > > > of hours and this looks stable under installcheck-world. I can see > > > that you've reworked quite a bit the surroundings of InitializedFrom > > > in this one. If you apply that once again at some point, the > > > buildfarm will be judge in the long-term, but I am rather confident by > > > saying that the situation looks better here, at least. > > > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks > > different now, so it must be correct" must be the weakest proof of > > correctness I've heard of! > > Michael just volunteered to help Yura and me with testing. He wan't > intended to be reviewer. And he reported that tests looks much more > stable now. I think he is absolutely correct with this. Nevertheless, I don't think the bug has gone in v12. I managed to reproduce it on my local Raspberry PI 4. The attached version of patch fixes the bug for me. It adds memory barriers surrounding pg_atomic_compare_exchange_u64(). That certainly not right given this function should already provide full memory barrier semantics. But my investigation shows it doesn't. I'm going to start a separate thread about this. Also, new version of patch contains fix of potential integer overflow during OldPageRqstPtr computation sent off-list my me by Yura. ------ Regards, Alexander Korotkov Supabase
Attachment
On Fri, Mar 7, 2025 at 5:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Sun, Mar 2, 2025 at 1:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > On 2025-Feb-28, Michael Paquier wrote: > > > > > > > Saying that, I have also done similar tests with your v12 for a couple > > > > of hours and this looks stable under installcheck-world. I can see > > > > that you've reworked quite a bit the surroundings of InitializedFrom > > > > in this one. If you apply that once again at some point, the > > > > buildfarm will be judge in the long-term, but I am rather confident by > > > > saying that the situation looks better here, at least. > > > > > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks > > > different now, so it must be correct" must be the weakest proof of > > > correctness I've heard of! > > > > Michael just volunteered to help Yura and me with testing. He wan't > > intended to be reviewer. And he reported that tests looks much more > > stable now. I think he is absolutely correct with this. > > Nevertheless, I don't think the bug has gone in v12. I managed to > reproduce it on my local Raspberry PI 4. The attached version of > patch fixes the bug for me. It adds memory barriers surrounding > pg_atomic_compare_exchange_u64(). That certainly not right given this > function should already provide full memory barrier semantics. But my > investigation shows it doesn't. I'm going to start a separate thread > about this. > > Also, new version of patch contains fix of potential integer overflow > during OldPageRqstPtr computation sent off-list my me by Yura. So, as we finally clarified CAS doesn't guarantee full memory barrier on failure. Also, it's not clear when barriers are guaranteed on success. In ARM without LSE implementation, read barrier is provided before change of value and write barrier after change of value. So, it appears that both explicit barriers I've added are required. This revision also comes with format proof of the algorithm. ------ Regards, Alexander Korotkov Supabase
Attachment
Hi, I've briefly looked at this patch this week, and done a bit of testing. I don't have any comments about the correctness - it does seem correct to me and I haven't noticed any crashes/issues, but I'm not familiar with the WALBufMappingLock enough to have insightful opinions. I have however decided to do a bit of benchmarking, to better understand the possible benefits of the change. I happen to have access to an Azure machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that can do ~1.5GB/s. The benchmark script (attached) uses the workload mentioned by Andres some time ago [1] SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE)); with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated results look like this (this is throughput): | 8 | 64 | 1024 clients | master patched | master patched | master patched --------------------------------------------------------------------- 1 | 11864 12035 | 7419 7345 | 968 940 4 | 26311 26919 | 12414 12308 | 1304 1293 8 | 38742 39651 | 14316 14539 | 1348 1348 16 | 57299 59917 | 15405 15871 | 1304 1279 32 | 74857 82598 | 17589 17126 | 1233 1233 48 | 87596 95495 | 18616 18160 | 1199 1227 64 | 89982 97715 | 19033 18910 | 1196 1221 96 | 92853 103448 | 19694 19706 | 1190 1210 128 | 95392 103324 | 20085 19873 | 1188 1213 160 | 94933 102236 | 20227 20323 | 1180 1214 196 | 95933 103341 | 20448 20513 | 1188 1199 To put this into a perspective, this throughput relative to master: clients | 8 64 1024 ---------------------------------- 1 | 101% 99% 97% 4 | 102% 99% 99% 8 | 102% 102% 100% 16 | 105% 103% 98% 32 | 110% 97% 100% 48 | 109% 98% 102% 64 | 109% 99% 102% 96 | 111% 100% 102% 128 | 108% 99% 102% 160 | 108% 100% 103% 196 | 108% 100% 101% That does not seem like a huge improvement :-( Yes, there's 1-10% speedup for the small (8K) size, but for larger chunks it's a wash. Looking at the pgbench progress, I noticed stuff like this: ... progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed ... i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny things (it's a virtual SSD, I'm not sure what model is that behind the curtains). So I decided to try running the benchmark on tmpfs, to get the storage out of the way and get the "best case" results. This makes the pgbench progress perfectly "smooth" (no jumps like in the output above), and the comparison looks like this: | 8 | 64 | 1024 clients | master patched | master patched | master patched ---------|---------------------|--------------------|---------------- 1 | 32449 32032 | 19289 20344 | 3108 3081 4 | 68779 69256 | 24585 29912 | 2915 3449 8 | 79787 100655 | 28217 39217 | 3182 4086 16 | 113024 148968 | 42969 62083 | 5134 5712 32 | 125884 170678 | 44256 71183 | 4910 5447 48 | 125571 166695 | 44693 76411 | 4717 5215 64 | 122096 160470 | 42749 83754 | 4631 5103 96 | 120170 154145 | 42696 86529 | 4556 5020 128 | 119204 152977 | 40880 88163 | 4529 5047 160 | 116081 152708 | 42263 88066 | 4512 5000 196 | 115364 152455 | 40765 88602 | 4505 4952 and the comparison to master: clients 8 64 1024 ----------------------------------------- 1 99% 105% 99% 4 101% 122% 118% 8 126% 139% 128% 16 132% 144% 111% 32 136% 161% 111% 48 133% 171% 111% 64 131% 196% 110% 96 128% 203% 110% 128 128% 216% 111% 160 132% 208% 111% 196 132% 217% 110% Yes, with tmpfs the impact looks much more significant. For 8K the speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x. That being said, I wonder how big is the impact for practical workloads. ISTM this workload is pretty narrow / extreme, it'd be much easier if we had an example of a more realistic workload, benefiting from this. Of course, it may be the case that there are multiple related bottlenecks, and we'd need to fix all of them - in which case it'd be silly to block the improvements on the grounds that it alone does not help. Another thought is that this is testing the "good case". Can anyone think of a workload that would be made worse by the patch? regards -- Tomas Vondra
Attachment
Good day, Tomas 14.03.2025 17:30, Tomas Vondra wrote: > > Yes, with tmpfs the impact looks much more significant. For 8K the > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x. > > > That being said, I wonder how big is the impact for practical workloads. > ISTM this workload is pretty narrow / extreme, it'd be much easier if we > had an example of a more realistic workload, benefiting from this. Of > course, it may be the case that there are multiple related bottlenecks, > and we'd need to fix all of them - in which case it'd be silly to block > the improvements on the grounds that it alone does not help. Yes, I found this bottleneck when I did experiments with increasing NUM_XLOGINSERT_LOCKS [1]. For this patch to be more valuable, there should be more parallel xlog inserters. That is why I initially paired this patch with patch that reduces contention on WALInsertLocks ("0002-Several attempts to lock WALInsertLock", last version at [2]). Certainly, largest bottleneck is WALWriteLock around writting buffers and especially fsync-ing them after write. But this intermediate bottleneck of WALBufMappingLock is also worth to be removed. [1] https://postgr.es/m/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru [2] https://postgr.es/m/c31158a3-7c26-4b26-90df-2df8f7bbe736%40postgrespro.ru ------- regards Yura Sokolov aka funny-falcon
Good day, 14.03.2025 17:30, Tomas Vondra wrote: > Hi, > > I've briefly looked at this patch this week, and done a bit of testing. > I don't have any comments about the correctness - it does seem correct > to me and I haven't noticed any crashes/issues, but I'm not familiar > with the WALBufMappingLock enough to have insightful opinions. > > I have however decided to do a bit of benchmarking, to better understand > the possible benefits of the change. I happen to have access to an Azure > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that > can do ~1.5GB/s. > > The benchmark script (attached) uses the workload mentioned by Andres > some time ago [1] > > SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE)); > > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated > results look like this (this is throughput): > > | 8 | 64 | 1024 > clients | master patched | master patched | master patched > --------------------------------------------------------------------- > 1 | 11864 12035 | 7419 7345 | 968 940 > 4 | 26311 26919 | 12414 12308 | 1304 1293 > 8 | 38742 39651 | 14316 14539 | 1348 1348 > 16 | 57299 59917 | 15405 15871 | 1304 1279 > 32 | 74857 82598 | 17589 17126 | 1233 1233 > 48 | 87596 95495 | 18616 18160 | 1199 1227 > 64 | 89982 97715 | 19033 18910 | 1196 1221 > 96 | 92853 103448 | 19694 19706 | 1190 1210 > 128 | 95392 103324 | 20085 19873 | 1188 1213 > 160 | 94933 102236 | 20227 20323 | 1180 1214 > 196 | 95933 103341 | 20448 20513 | 1188 1199 > > To put this into a perspective, this throughput relative to master: > > clients | 8 64 1024 > ---------------------------------- > 1 | 101% 99% 97% > 4 | 102% 99% 99% > 8 | 102% 102% 100% > 16 | 105% 103% 98% > 32 | 110% 97% 100% > 48 | 109% 98% 102% > 64 | 109% 99% 102% > 96 | 111% 100% 102% > 128 | 108% 99% 102% > 160 | 108% 100% 103% > 196 | 108% 100% 101% > > That does not seem like a huge improvement :-( Yes, there's 1-10% > speedup for the small (8K) size, but for larger chunks it's a wash. > > Looking at the pgbench progress, I noticed stuff like this: > > ... > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed > ... > > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny > things (it's a virtual SSD, I'm not sure what model is that behind the > curtains). So I decided to try running the benchmark on tmpfs, to get > the storage out of the way and get the "best case" results. > > This makes the pgbench progress perfectly "smooth" (no jumps like in the > output above), and the comparison looks like this: > > | 8 | 64 | 1024 > clients | master patched | master patched | master patched > ---------|---------------------|--------------------|---------------- > 1 | 32449 32032 | 19289 20344 | 3108 3081 > 4 | 68779 69256 | 24585 29912 | 2915 3449 > 8 | 79787 100655 | 28217 39217 | 3182 4086 > 16 | 113024 148968 | 42969 62083 | 5134 5712 > 32 | 125884 170678 | 44256 71183 | 4910 5447 > 48 | 125571 166695 | 44693 76411 | 4717 5215 > 64 | 122096 160470 | 42749 83754 | 4631 5103 > 96 | 120170 154145 | 42696 86529 | 4556 5020 > 128 | 119204 152977 | 40880 88163 | 4529 5047 > 160 | 116081 152708 | 42263 88066 | 4512 5000 > 196 | 115364 152455 | 40765 88602 | 4505 4952 > > and the comparison to master: > > clients 8 64 1024 > ----------------------------------------- > 1 99% 105% 99% > 4 101% 122% 118% > 8 126% 139% 128% > 16 132% 144% 111% > 32 136% 161% 111% > 48 133% 171% 111% > 64 131% 196% 110% > 96 128% 203% 110% > 128 128% 216% 111% > 160 132% 208% 111% > 196 132% 217% 110% > > Yes, with tmpfs the impact looks much more significant. For 8K the > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x. > > > That being said, I wonder how big is the impact for practical workloads. > ISTM this workload is pretty narrow / extreme, it'd be much easier if we > had an example of a more realistic workload, benefiting from this. Of > course, it may be the case that there are multiple related bottlenecks, > and we'd need to fix all of them - in which case it'd be silly to block > the improvements on the grounds that it alone does not help. > > Another thought is that this is testing the "good case". Can anyone > think of a workload that would be made worse by the patch? I've made similar benchmark on system with two Xeon Gold 5220R with two Samsung SSD 970 PRO 1TB mirrored by md. Configuration changes: wal_sync_method = open_datasync full_page_writes = off synchronous_commit = off checkpoint_timeout = 1d max_connections = 1000 max_wal_size = 4GB min_wal_size = 640MB I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and 1GB) and record size (1kB, 8kB and 64kB). (I didn't bench 1MB record size, since I don't believe it is critical for performance). Here's results for 64MB segment size and 1GB wal_buffers: +---------+---------+------------+--------------+----------+ | recsize | clients | master_tps | nowalbuf_tps | rel_perf | +---------+---------+------------+--------------+----------+ | 1 | 1 | 47991.0 | 46995.0 | 0.98 | | 1 | 4 | 171930.0 | 171166.0 | 1.0 | | 1 | 16 | 491240.0 | 485132.0 | 0.99 | | 1 | 64 | 514590.0 | 515534.0 | 1.0 | | 1 | 128 | 547222.0 | 543543.0 | 0.99 | | 1 | 256 | 543353.0 | 540802.0 | 1.0 | | 8 | 1 | 40976.0 | 41603.0 | 1.02 | | 8 | 4 | 89003.0 | 92008.0 | 1.03 | | 8 | 16 | 90457.0 | 92282.0 | 1.02 | | 8 | 64 | 89293.0 | 92022.0 | 1.03 | | 8 | 128 | 92687.0 | 92768.0 | 1.0 | | 8 | 256 | 91874.0 | 91665.0 | 1.0 | | 64 | 1 | 11829.0 | 12031.0 | 1.02 | | 64 | 4 | 11959.0 | 12832.0 | 1.07 | | 64 | 16 | 11331.0 | 13417.0 | 1.18 | | 64 | 64 | 11108.0 | 13588.0 | 1.22 | | 64 | 128 | 11089.0 | 13648.0 | 1.23 | | 64 | 256 | 10381.0 | 13542.0 | 1.3 | +---------+---------+------------+--------------+----------+ Numbers for all configurations in attached 'improvements.out' . It shows, removing WALBufMappingLock almost always doesn't harm performance and usually gives measurable gain. (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and maximum tps from 6 runs and took average from remaining). Also sqlite database is attached with all results. It also contains results for patch "Several attempts to lock WALInsertLock" (named "attempts") and cumulative patch ("nowalbuf-attempts"). Suprisingly, "Several attempts" causes measurable impact in some configurations with hundreds of clients. So, there're more bottlenecks ahead )) Yes, it is still not "real-world" benchmark. But it at least shows patch is harmless. -- regards Yura Sokolov aka funny-falcon
Attachment
On Mon, Mar 31, 2025 at 1:42 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > 14.03.2025 17:30, Tomas Vondra wrote: > > Hi, > > > > I've briefly looked at this patch this week, and done a bit of testing. > > I don't have any comments about the correctness - it does seem correct > > to me and I haven't noticed any crashes/issues, but I'm not familiar > > with the WALBufMappingLock enough to have insightful opinions. > > > > I have however decided to do a bit of benchmarking, to better understand > > the possible benefits of the change. I happen to have access to an Azure > > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that > > can do ~1.5GB/s. > > > > The benchmark script (attached) uses the workload mentioned by Andres > > some time ago [1] > > > > SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE)); > > > > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated > > results look like this (this is throughput): > > > > | 8 | 64 | 1024 > > clients | master patched | master patched | master patched > > --------------------------------------------------------------------- > > 1 | 11864 12035 | 7419 7345 | 968 940 > > 4 | 26311 26919 | 12414 12308 | 1304 1293 > > 8 | 38742 39651 | 14316 14539 | 1348 1348 > > 16 | 57299 59917 | 15405 15871 | 1304 1279 > > 32 | 74857 82598 | 17589 17126 | 1233 1233 > > 48 | 87596 95495 | 18616 18160 | 1199 1227 > > 64 | 89982 97715 | 19033 18910 | 1196 1221 > > 96 | 92853 103448 | 19694 19706 | 1190 1210 > > 128 | 95392 103324 | 20085 19873 | 1188 1213 > > 160 | 94933 102236 | 20227 20323 | 1180 1214 > > 196 | 95933 103341 | 20448 20513 | 1188 1199 > > > > To put this into a perspective, this throughput relative to master: > > > > clients | 8 64 1024 > > ---------------------------------- > > 1 | 101% 99% 97% > > 4 | 102% 99% 99% > > 8 | 102% 102% 100% > > 16 | 105% 103% 98% > > 32 | 110% 97% 100% > > 48 | 109% 98% 102% > > 64 | 109% 99% 102% > > 96 | 111% 100% 102% > > 128 | 108% 99% 102% > > 160 | 108% 100% 103% > > 196 | 108% 100% 101% > > > > That does not seem like a huge improvement :-( Yes, there's 1-10% > > speedup for the small (8K) size, but for larger chunks it's a wash. > > > > Looking at the pgbench progress, I noticed stuff like this: > > > > ... > > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed > > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed > > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed > > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed > > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed > > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed > > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed > > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed > > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed > > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed > > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed > > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed > > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed > > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed > > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed > > ... > > > > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny > > things (it's a virtual SSD, I'm not sure what model is that behind the > > curtains). So I decided to try running the benchmark on tmpfs, to get > > the storage out of the way and get the "best case" results. > > > > This makes the pgbench progress perfectly "smooth" (no jumps like in the > > output above), and the comparison looks like this: > > > > | 8 | 64 | 1024 > > clients | master patched | master patched | master patched > > ---------|---------------------|--------------------|---------------- > > 1 | 32449 32032 | 19289 20344 | 3108 3081 > > 4 | 68779 69256 | 24585 29912 | 2915 3449 > > 8 | 79787 100655 | 28217 39217 | 3182 4086 > > 16 | 113024 148968 | 42969 62083 | 5134 5712 > > 32 | 125884 170678 | 44256 71183 | 4910 5447 > > 48 | 125571 166695 | 44693 76411 | 4717 5215 > > 64 | 122096 160470 | 42749 83754 | 4631 5103 > > 96 | 120170 154145 | 42696 86529 | 4556 5020 > > 128 | 119204 152977 | 40880 88163 | 4529 5047 > > 160 | 116081 152708 | 42263 88066 | 4512 5000 > > 196 | 115364 152455 | 40765 88602 | 4505 4952 > > > > and the comparison to master: > > > > clients 8 64 1024 > > ----------------------------------------- > > 1 99% 105% 99% > > 4 101% 122% 118% > > 8 126% 139% 128% > > 16 132% 144% 111% > > 32 136% 161% 111% > > 48 133% 171% 111% > > 64 131% 196% 110% > > 96 128% 203% 110% > > 128 128% 216% 111% > > 160 132% 208% 111% > > 196 132% 217% 110% > > > > Yes, with tmpfs the impact looks much more significant. For 8K the > > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x. > > > > > > That being said, I wonder how big is the impact for practical workloads. > > ISTM this workload is pretty narrow / extreme, it'd be much easier if we > > had an example of a more realistic workload, benefiting from this. Of > > course, it may be the case that there are multiple related bottlenecks, > > and we'd need to fix all of them - in which case it'd be silly to block > > the improvements on the grounds that it alone does not help. > > > > Another thought is that this is testing the "good case". Can anyone > > think of a workload that would be made worse by the patch? > > I've made similar benchmark on system with two Xeon Gold 5220R with two > Samsung SSD 970 PRO 1TB mirrored by md. > > Configuration changes: > wal_sync_method = open_datasync > full_page_writes = off > synchronous_commit = off > checkpoint_timeout = 1d > max_connections = 1000 > max_wal_size = 4GB > min_wal_size = 640MB > > I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and > 1GB) and record size (1kB, 8kB and 64kB). > > (I didn't bench 1MB record size, since I don't believe it is critical for > performance). > > Here's results for 64MB segment size and 1GB wal_buffers: > > +---------+---------+------------+--------------+----------+ > | recsize | clients | master_tps | nowalbuf_tps | rel_perf | > +---------+---------+------------+--------------+----------+ > | 1 | 1 | 47991.0 | 46995.0 | 0.98 | > | 1 | 4 | 171930.0 | 171166.0 | 1.0 | > | 1 | 16 | 491240.0 | 485132.0 | 0.99 | > | 1 | 64 | 514590.0 | 515534.0 | 1.0 | > | 1 | 128 | 547222.0 | 543543.0 | 0.99 | > | 1 | 256 | 543353.0 | 540802.0 | 1.0 | > | 8 | 1 | 40976.0 | 41603.0 | 1.02 | > | 8 | 4 | 89003.0 | 92008.0 | 1.03 | > | 8 | 16 | 90457.0 | 92282.0 | 1.02 | > | 8 | 64 | 89293.0 | 92022.0 | 1.03 | > | 8 | 128 | 92687.0 | 92768.0 | 1.0 | > | 8 | 256 | 91874.0 | 91665.0 | 1.0 | > | 64 | 1 | 11829.0 | 12031.0 | 1.02 | > | 64 | 4 | 11959.0 | 12832.0 | 1.07 | > | 64 | 16 | 11331.0 | 13417.0 | 1.18 | > | 64 | 64 | 11108.0 | 13588.0 | 1.22 | > | 64 | 128 | 11089.0 | 13648.0 | 1.23 | > | 64 | 256 | 10381.0 | 13542.0 | 1.3 | > +---------+---------+------------+--------------+----------+ > > Numbers for all configurations in attached 'improvements.out' . It shows, > removing WALBufMappingLock almost always doesn't harm performance and > usually gives measurable gain. > > (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and > maximum tps from 6 runs and took average from remaining). > > Also sqlite database is attached with all results. It also contains results > for patch "Several attempts to lock WALInsertLock" (named "attempts") and > cumulative patch ("nowalbuf-attempts"). > Suprisingly, "Several attempts" causes measurable impact in some > configurations with hundreds of clients. So, there're more bottlenecks ahead )) > > > Yes, it is still not "real-world" benchmark. But it at least shows patch is > harmless. Thank you for your experiments. Your results shows up to 30% speedups on real hardware, not tmpfs. While this is still a corner case, I think this is quite a results for a pretty local optimization. On small connection number there are some cases above and below 1.0. I think this due to statistical error. If we would calculate average tps ratio across different experiments, for low number of clients it's still above 1.0. sqlite> select clients, avg(ratio) from (select walseg, walbuf, recsize, clients, (avg(tps) filter (where branch = 'nowalbuf'))/(avg(tps) filter (where branch = 'master')) as ratio from results where branch in ('master', 'nowalbuf') group by walseg, walbuf, recsize, clients) x group by clients; 1|1.00546614169766 4|1.00782085856889 16|1.02257892337757 64|1.04400167838906 128|1.04134006876033 256|1.04627949500578 I'm going to push the first patch ("nowalbuf") if no objections. I think the second one ("Several attempts") still needs more work, as there are regressions. ------ Regards, Alexander Korotkov Supabase