Thread: Re: Get rid of WALBufMappingLock

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

07 February, 14:39:09

On Fri, Feb 7, 2025 at 1:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> 07.02.2025 14:02, Yura Sokolov пишет:
> > 07.02.2025 01:26, Alexander Korotkov пишет:
> >> Hi!
> >>
> >> On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> >>>
> >>> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
> >>> used benchmark which creates WAL records very intensively. While I this
> >>> it is not completely fair (1MB log records are really rare), it pushed
> >>> me to analyze write-side waiting of XLog machinery.
> >>>
> >>> First I tried to optimize WaitXLogInsertionsToFinish, but without great
> >>> success (yet).
> >>>
> >>> While profiling, I found a lot of time is spend in the memory clearing
> >>> under global WALBufMappingLock:
> >>>
> >>>      MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
> >>>
> >>> It is obvious scalability bottleneck.
> >>>
> >>> So "challenge was accepted".
> >>>
> >>> Certainly, backend should initialize pages without exclusive lock. But
> >>> which way to ensure pages were initialized? In other words, how to
> >>> ensure XLogCtl->InitializedUpTo is correct.
> >>>
> >>> I've tried to play around WALBufMappingLock with holding it for a short
> >>> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
> >>> WALBufMappingLock is useless at all.
> >>>
> >>> Instead of holding lock, it is better to allow backends to cooperate:
> >>> - I bound ConditionVariable to each xlblocks entry,
> >>> - every backend now checks every required block pointed by
> >>> InitializedUpto was successfully initialized or sleeps on its condvar,
> >>> - when backend sure block is initialized, it tries to update
> >>> InitializedUpTo with conditional variable.
> >>
> >> Looks reasonable for me, but having ConditionVariable per xlog buffer
> >> seems overkill for me.  Find an attached revision, where I've
> >> implemented advancing InitializedUpTo without ConditionVariable.
> >> After initialization of each buffer there is attempt to do CAS for
> >> InitializedUpTo in a loop.  So, multiple processes will try to advance
> >> InitializedUpTo, they could hijack initiative from each other, but
> >> there is always a leader which will finish the work.
> >>
> >> There is only one ConditionVariable to wait for InitializedUpTo being advanced.
> >>
> >> I didn't benchmark my version, just checked that tests passed.
> >
> > Good day, Alexander.
>
> Seems I was mistaken twice.
>
> > I've got mixed but quite close result for both approaches (single or many
> > ConditionVariable) on the notebook. Since I have no access to larger
> > machine, I can't prove "many" is way better (or discover it worse).
>
> 1. "many condvars" (my variant) is strictly worse with num locks = 64 and
> when pg_logical_emit_message emits just 1kB instead of 1MB.
>
> Therefore, "single condvar" is strictly better.
>
> > Given patch after cleanup looks a bit smaller and clearer, I agree to keep
> > just single condition variable.
> >
> > Cleaned version is attached.
> >
> > I've changed condition for broadcast a bit ("less" instead "not equal"):
> > - buffer's border may already go into future,
> > - and then other backend will reach not yet initialized buffer and will
> > broadcast.
>
> 2. I've inserted abort if "buffer's border went into future", and wasn't
> able to trigger it.
>
> So I returned update-loop's body to your variant.

Good, thank you.  I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

Regarding 0002 patch, it looks generally reasonable.  But are 2
attempts always optimal?  Are there cases of regression, or cases when
more attempts are even better?  Could we have there some
self-adjusting mechanism like what we have for spinlocks?

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

08 February, 13:07:16

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Good, thank you.  I think 0001 patch is generally good, but needs some
> further polishing, e.g. more comments explaining how does it work.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
sensitive to that.  If so, I would propose to explicitly comment that
and add corresponding asserts.
2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Yura Sokolov

Date:

13 February, 12:45:26

13.02.2025 12:34, Alexander Korotkov пишет:
> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>> 08.02.2025 13:07, Alexander Korotkov пишет:
>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>>> Good, thank you.  I think 0001 patch is generally good, but needs some
>>>> further polishing, e.g. more comments explaining how does it work.
>>
>> I tried to add more comments. I'm not good at, so recommendations are welcome.
>>
>>> Two things comes to my mind worth rechecking about 0001.
>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
>>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
>>> sensitive to that.  If so, I would propose to explicitly comment that
>>> and add corresponding asserts.
>>
>> They're certainly page aligned, since they are page borders.
>> I added assert on alignment of InitializeReserved for the sanity.
>>
>>> 2) Check if there are concurrency issues between
>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
>>
>> There are no issues:
>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
>> GetXLogBuffer to zero-out WAL page.
>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
>> so switching wal is not concurrent. (Although, there is no need in this
>> exclusiveness, imho.)
> 
> Good, thank you.  I've also revised commit message and comments.
> 
> But I see another issue with this patch.  In the worst case, we do
> XLogWrite() by ourselves, and it could potentially could error out.
> Without patch, that would cause WALBufMappingLock be released and
> XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> cause other processes infinitely waiting till we finish the
> initialization.
> 
> Possible solution would be to save position of the page to be
> initialized, and set it back to XLogCtl->InitializeReserved on error
> (everywhere we do LWLockReleaseAll()).  We also must check that on
> error we only set XLogCtl->InitializeReserved to the past, because
> there could be multiple concurrent failures.  Also we need to
> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

-------
regards
Yura Sokolov aka funny-falcon

Re: Get rid of WALBufMappingLock

From

Pavel Borisov

Date:

13 February, 19:39:32

Hi, Yura and Alexander!

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >>>> Good, thank you.  I think 0001 patch is generally good, but needs some
> > >>>> further polishing, e.g. more comments explaining how does it work.
> > >>
> > >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> > >>
> > >>> Two things comes to my mind worth rechecking about 0001.
> > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > >>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> > >>> sensitive to that.  If so, I would propose to explicitly comment that
> > >>> and add corresponding asserts.
> > >>
> > >> They're certainly page aligned, since they are page borders.
> > >> I added assert on alignment of InitializeReserved for the sanity.
> > >>
> > >>> 2) Check if there are concurrency issues between
> > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > >>
> > >> There are no issues:
> > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > >> GetXLogBuffer to zero-out WAL page.
> > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > >> so switching wal is not concurrent. (Although, there is no need in this
> > >> exclusiveness, imho.)
> > >
> > > Good, thank you.  I've also revised commit message and comments.
> > >
> > > But I see another issue with this patch.  In the worst case, we do
> > > XLogWrite() by ourselves, and it could potentially could error out.
> > > Without patch, that would cause WALBufMappingLock be released and
> > > XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> > > cause other processes infinitely waiting till we finish the
> > > initialization.
> > >
> > > Possible solution would be to save position of the page to be
> > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > (everywhere we do LWLockReleaseAll()).  We also must check that on
> > > error we only set XLogCtl->InitializeReserved to the past, because
> > > there could be multiple concurrent failures.  Also we need to
> > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> >
> > The single place where AdvanceXLInsertBuffer is called outside of critical
> > section is in XLogBackgroundFlush. All other call stacks will issue server
> > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> >
> > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > opportunistic=true parameter.
> >
> > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > since server will shutdown/restart.
> >
> > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > to XLogWrite here?
>
> You're correct.  I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

The overall goal to avoid locking looks good. Both patches look right
to me. The only thing I'm slightly concerned about if patchset could
demonstrate performance differences in any real workload. I appreciate
it as a beautiful optimization anyway.

I have the following proposals to change the code and comments:

For patch 0001:

- Struct XLBlocks contains just one pg_atomic_uint64 member. Is it
still needed as a struct? These changes make a significant volume of
changes to the patch, being noop. Maybe it was inherited from v1 and
not needed anymore.

- Furthermore when xlblocks became a struct comments like:
> and xlblocks values certainly do.  xlblocks values are changed
need to be changed to xlblocks.bound. This could be avoided by
changing back xlblocks from type XLBlocks * to pg_atomic_uint64 *

- It's worth more detailed commenting
InitializedUpTo/InitializedUpToCondVar than:
+        * It is updated to successfully initialized buffer's
identities, perhaps
+        * waiting on conditional variable bound to buffer.

"perhaps waiting" could also be in style "maybe/even while AAA waits BBB"

"lock-free with cooperation with" -> "lock-free accompanied by changes to..."

- Comment inside typedef struct XLogCtlData:
/* 1st byte ptr-s + XLOG_BLCKSZ (and condvar * for) */
need to be returned back
/* 1st byte ptr-s + XLOG_BLCKSZ */

- Commented out code for cleanup in the final patch:
 //ConditionVariableBroadcast(&XLogCtl->xlblocks[nextidx].condvar);

- in AdvanceXLInsertBuffer()
npages initialised to 0 but it is not increased anymore
Block under
> if (XLOG_DEBUG && npages > 0)
became unreachable

(InvalidXLogRecPtr + 1) is essentially 0+1 and IMO this semantically
calls for adding #define FirstValidXLogRecPtr 1

Typo in a commit message: %s/usign/using/g

For patch 0002:

I think Yura's explanation from above in this thread need to get place
in a commit message and in a comment to this:
> int attempts = 2;

Comments around:
"try another lock next time" could be modified to reflect that we do
repeat twice

Kind regards,
Pavel Borisov
Supabase

Re: Get rid of WALBufMappingLock

From

Pavel Borisov

Date:

14 February, 12:45:44

Hi, Alexander!

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi, Pavel!
>
> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> > On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > >
> > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > > > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > > > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > > > >>>> Good, thank you.  I think 0001 patch is generally good, but needs some
> > > > >>>> further polishing, e.g. more comments explaining how does it work.
> > > > >>
> > > > >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> > > > >>
> > > > >>> Two things comes to my mind worth rechecking about 0001.
> > > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > > > >>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> > > > >>> sensitive to that.  If so, I would propose to explicitly comment that
> > > > >>> and add corresponding asserts.
> > > > >>
> > > > >> They're certainly page aligned, since they are page borders.
> > > > >> I added assert on alignment of InitializeReserved for the sanity.
> > > > >>
> > > > >>> 2) Check if there are concurrency issues between
> > > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > > > >>
> > > > >> There are no issues:
> > > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > > > >> GetXLogBuffer to zero-out WAL page.
> > > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > > > >> so switching wal is not concurrent. (Although, there is no need in this
> > > > >> exclusiveness, imho.)
> > > > >
> > > > > Good, thank you.  I've also revised commit message and comments.
> > > > >
> > > > > But I see another issue with this patch.  In the worst case, we do
> > > > > XLogWrite() by ourselves, and it could potentially could error out.
> > > > > Without patch, that would cause WALBufMappingLock be released and
> > > > > XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> > > > > cause other processes infinitely waiting till we finish the
> > > > > initialization.
> > > > >
> > > > > Possible solution would be to save position of the page to be
> > > > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > > > (everywhere we do LWLockReleaseAll()).  We also must check that on
> > > > > error we only set XLogCtl->InitializeReserved to the past, because
> > > > > there could be multiple concurrent failures.  Also we need to
> > > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> > > >
> > > > The single place where AdvanceXLInsertBuffer is called outside of critical
> > > > section is in XLogBackgroundFlush. All other call stacks will issue server
> > > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> > > >
> > > > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > > > opportunistic=true parameter.
> > > >
> > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > > > since server will shutdown/restart.
> > > >
> > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > > > to XLogWrite here?
> > >
> > > You're correct.  I just reflected this in the next revision of the patch.
> >
> > I've looked at the patchset v6.
>
> Oh, sorry, I really did wrong.  I've done git format-patch for wrong
> local branch for v5 and v6.  Patches I've sent for v5 and v6 are
> actually the same as my v1.  This is really pity.  Please, find the
> right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

[1] https://www.postgresql.org/message-id/d6799557-e352-42c8-80cc-ed36e3b8893c%40postgrespro.ru

Regards,
Pavel Borisov
Supabase

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

15 February, 12:25:00

Hi!

On Fri, Feb 14, 2025 at 4:11 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> 14.02.2025 17:09, Yura Sokolov пишет:
> > 14.02.2025 13:24, Alexander Korotkov пишет:
> >> On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> >>> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >>>> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> >>>>> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >>>>>>
> >>>>>> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> >>>>>>> 13.02.2025 12:34, Alexander Korotkov пишет:
> >>>>>>>> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> >>>>>>>>> 08.02.2025 13:07, Alexander Korotkov пишет:
> >>>>>>>>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >>>>>>>>>>> Good, thank you.  I think 0001 patch is generally good, but needs some
> >>>>>>>>>>> further polishing, e.g. more comments explaining how does it work.
> >>>>>>>>>
> >>>>>>>>> I tried to add more comments. I'm not good at, so recommendations are welcome.
> >>>>>>>>>
> >>>>>>>>>> Two things comes to my mind worth rechecking about 0001.
> >>>>>>>>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> >>>>>>>>>> XLogCtl->xlblocks always page-aligned?  Because algorithm seems to be
> >>>>>>>>>> sensitive to that.  If so, I would propose to explicitly comment that
> >>>>>>>>>> and add corresponding asserts.
> >>>>>>>>>
> >>>>>>>>> They're certainly page aligned, since they are page borders.
> >>>>>>>>> I added assert on alignment of InitializeReserved for the sanity.
> >>>>>>>>>
> >>>>>>>>>> 2) Check if there are concurrency issues between
> >>>>>>>>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> >>>>>>>>>
> >>>>>>>>> There are no issues:
> >>>>>>>>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> >>>>>>>>> GetXLogBuffer to zero-out WAL page.
> >>>>>>>>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> >>>>>>>>> so switching wal is not concurrent. (Although, there is no need in this
> >>>>>>>>> exclusiveness, imho.)
> >>>>>>>>
> >>>>>>>> Good, thank you.  I've also revised commit message and comments.
> >>>>>>>>
> >>>>>>>> But I see another issue with this patch.  In the worst case, we do
> >>>>>>>> XLogWrite() by ourselves, and it could potentially could error out.
> >>>>>>>> Without patch, that would cause WALBufMappingLock be released and
> >>>>>>>> XLogCtl->InitializedUpTo not advanced.  With the patch, that would
> >>>>>>>> cause other processes infinitely waiting till we finish the
> >>>>>>>> initialization.
> >>>>>>>>
> >>>>>>>> Possible solution would be to save position of the page to be
> >>>>>>>> initialized, and set it back to XLogCtl->InitializeReserved on error
> >>>>>>>> (everywhere we do LWLockReleaseAll()).  We also must check that on
> >>>>>>>> error we only set XLogCtl->InitializeReserved to the past, because
> >>>>>>>> there could be multiple concurrent failures.  Also we need to
> >>>>>>>> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> >>>>>>>
> >>>>>>> The single place where AdvanceXLInsertBuffer is called outside of critical
> >>>>>>> section is in XLogBackgroundFlush. All other call stacks will issue server
> >>>>>>> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> >>>>>>>
> >>>>>>> XLogBackgroundFlush explicitely avoids writing buffers by passing
> >>>>>>> opportunistic=true parameter.
> >>>>>>>
> >>>>>>> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> >>>>>>> since server will shutdown/restart.
> >>>>>>>
> >>>>>>> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> >>>>>>> to XLogWrite here?
> >>>>>>
> >>>>>> You're correct.  I just reflected this in the next revision of the patch.
> >>>>>
> >>>>> I've looked at the patchset v6.
> >>>>
> >>>> Oh, sorry, I really did wrong.  I've done git format-patch for wrong
> >>>> local branch for v5 and v6.  Patches I've sent for v5 and v6 are
> >>>> actually the same as my v1.  This is really pity.  Please, find the
> >>>> right version of patchset attached.
> >>>
> >>> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
> >>> landed in v7.
> >>>
> >>> Other changes are not regarding code behavior. The things from my
> >>> previous review that still could apply to v7:
> >>>
> >>> For 0001:
> >>>
> >>> Comment change proposed:
> >>> "lock-free with cooperation with" -> "lock-free accompanied by changes
> >>> to..." (maybe other variant)
> >>
> >> Good catch.  I've rephrased this comment even more.
> >>
> >>> I propose a new define:
> >>> #define FirstValidXLogRecPtr 1
> >>> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
> >>> that has no semantical meaning and it's better to avoid using direct
> >>> arithmetics to relate meaning of FirstValidXLogRecPtr from
> >>> InvalidXLogRecPtr.
> >>
> >> Makes sense, but I'm not sure if this change is required at all.  I've
> >> reverted this to the state of master, and everything seems to work.
> >>
> >>> For 0002 both comments proposals from my message applied to v6 apply
> >>> to v7 as well
> >>
> >> Thank you for pointing.  For now, I'm concentrated on improvements on
> >> 0001.  Probably Yura could work on your notes to 0002.
> >
> > I wrote good commit message for 0002 with calculated probabilities and
> > simple Ruby program which calculates them to explain choice of 2
> > conditional attempts. (At least I hope the message is good). And added
> > simple comment before `int attempts = 2;`
> >
> > Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
> > added static assert on NUM_XLOGINSERT_LOCKS being power of 2.
> >
> > (0001 patch is same as for v8)
>
> Oops, forgot to add StaticAssert into v9-0002.

Thank you.  I'm planning to push 0001 if there is no objections.  And
I'm planning to do more review/revision of 0002.

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Kirill Reshke

Date:

17 February, 08:19:07

Hi!
I spotted a typo in v10:


+ /*
+ * Page at nextidx wasn't initialized yet, so we cann't move
+ * InitializedUpto further. It will be moved by backend which
+ * will initialize nextidx.
+ */

cann't - > can't

moved by backend -> moved by the backend



-- 
Best regards,
Kirill Reshke

Re: Get rid of WALBufMappingLock

From

Victor Yegorov

Date:

17 February, 11:46:57

Hey.

I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually.

Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead?

Victor Yegorov

Re: Get rid of WALBufMappingLock

From

Pavel Borisov

Date:

17 February, 12:20:09

Hi, Victor!

On Mon, 17 Feb 2025 at 12:47, Victor Yegorov <vyegorov@gmail.com> wrote:
>
> Hey.
>
> I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually.
>
> Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead?

The patch replaces WALBufMappingLock with a lockless algorithm based
on atomic variables and CV. Mentioning only CV in the head is only a
part of implementation. Also, the header should better reflect what is
done on the whole, than the implementation details. So I'd rather see
a header like "Replace WALBufMappingLock by lockless algorithm" or
"Initialize WAL buffers concurrently without using WALBufMappingLock"
or something like that.

Kind regards,
Pavel Borisov
Supabase

Re: Get rid of WALBufMappingLock

From

Pavel Borisov

Date:

17 February, 12:24:06

On Mon, 17 Feb 2025 at 13:20, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> Hi, Victor!
>
> On Mon, 17 Feb 2025 at 12:47, Victor Yegorov <vyegorov@gmail.com> wrote:
> >
> > Hey.
> >
> > I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually.
> >
> > Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead?
>
> The patch replaces WALBufMappingLock with a lockless algorithm based
> on atomic variables and CV. Mentioning only CV in the head is only a
> part of implementation. Also, the header should better reflect what is
> done on the whole, than the implementation details. So I'd rather see
> a header like "Replace WALBufMappingLock by lockless algorithm" or
> "Initialize WAL buffers concurrently without using WALBufMappingLock"
> or something like that.
Update: I see the patch is already committed, so we're late with the
naming proposals. I don't see problem with existing commit message
TBH.

Kind regards,
Pavel Borisov

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

17 February, 13:34:47

On Mon, Feb 17, 2025 at 11:44 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> Oops, I send wrong patch as a fix.
> The right one is attached.
>
> Pavel

I've spotted the failure on the buildfarm.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2025-02-17%2008%3A05%3A03
I can't quickly guess the reason.  I'm going to revert patch for now,
then we investigate

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Tom Lane

Date:

17 February, 19:25:05

Alexander Korotkov <aekorotkov@gmail.com> writes:
> I've spotted the failure on the buildfarm.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2025-02-17%2008%3A05%3A03
> I can't quickly guess the reason.  I'm going to revert patch for now,
> then we investigate

This timeout failure on hachi looks suspicious as well:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03

Might be relevant that they are both aarch64?

            regards, tom lane

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

18 February, 03:29:50

On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> > This timeout failure on hachi looks suspicious as well:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
> >
> > Might be relevant that they are both aarch64?
>
> Just logged into the host.  The logs of the timed out run are still
> around, and the last information I can see is from lastcommand.log,
> which seems to have frozen in time when the timeout has begun its
> vacuuming work:
> ok 73        + index_including_gist 353 ms
> # parallel group (16 tests):  create_cast errors create_aggregate drop_if_exists infinite_recurse
>
> gokiburi is on the same host, and it is currently frozen in time when
> trying to fetch a WAL buffer.  One of the stack traces:
> #2  0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
> cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
> latch.c:1571
> #3  WaitEventSetWait (set=0xd34ce0, timeout=-1,
> occurred_events=occurred_events@entry=0xffffffffadd8,
> nevents=nevents@entry=1, wait_event_info=<optimized out>,
> wait_event_info@entry=134217781) at latch.c:1519
> #4  0x000000000084e964 in WaitLatch (latch=<optimized out>,
> wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
> wait_event_info=wait_event_info@entry=134217781)     at latch.c:538
> #5  0x000000000085d2f8 in ConditionVariableTimedSleep
> (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
> condition_variable.c:163
> #6  0x000000000085d1ec in ConditionVariableSleep
> (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
> #7  0x000000000055f4f4 in AdvanceXLInsertBuffer
> (upto=upto@entry=112064880, tli=tli@entry=1, opportunistic=false) at
> xlog.c:2224
> #8  0x0000000000568398 in GetXLogBuffer (ptr=ptr@entry=112064880,
> tli=tli@entry=1) at xlog.c:1710
> #9  0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
> isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
> EndPos=<optimized out>, tli=1)     at xlog.c:1245
> #10 XLogInsertRecord (rdata=rdata@entry=0xcc49b0 <hdr_rdt>,
> fpw_lsn=fpw_lsn@entry=112025520, flags=0 '\000', num_fpi=<optimized
> out>, num_fpi@entry=0,      topxid_included=false) at xlog.c:928
> #11 0x000000000056b870 in XLogInsert (rmid=rmid@entry=16 '\020',
> info=<optimized out>, info@entry=16 '\020') at xloginsert.c:523
> #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
> state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
> parent=<optimized out>,
>
> So, yes, something looks really wrong with this patch.  Sounds
> plausible to me that some other buildfarm animals could be stuck
> without their owners knowing about it.  It's proving to be a good idea
> to force a timeout value in the configuration file of these animals..

Tom, Michael, thank you for the information.
This patch will be better tested before next attempt.

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

25 February, 18:19:29

On Tue, Feb 18, 2025 at 2:29 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> > > This timeout failure on hachi looks suspicious as well:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
> > >
> > > Might be relevant that they are both aarch64?
> >
> > Just logged into the host.  The logs of the timed out run are still
> > around, and the last information I can see is from lastcommand.log,
> > which seems to have frozen in time when the timeout has begun its
> > vacuuming work:
> > ok 73        + index_including_gist 353 ms
> > # parallel group (16 tests):  create_cast errors create_aggregate drop_if_exists infinite_recurse
> >
> > gokiburi is on the same host, and it is currently frozen in time when
> > trying to fetch a WAL buffer.  One of the stack traces:
> > #2  0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
> > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
> > latch.c:1571
> > #3  WaitEventSetWait (set=0xd34ce0, timeout=-1,
> > occurred_events=occurred_events@entry=0xffffffffadd8,
> > nevents=nevents@entry=1, wait_event_info=<optimized out>,
> > wait_event_info@entry=134217781) at latch.c:1519
> > #4  0x000000000084e964 in WaitLatch (latch=<optimized out>,
> > wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
> > wait_event_info=wait_event_info@entry=134217781)     at latch.c:538
> > #5  0x000000000085d2f8 in ConditionVariableTimedSleep
> > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
> > condition_variable.c:163
> > #6  0x000000000085d1ec in ConditionVariableSleep
> > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
> > #7  0x000000000055f4f4 in AdvanceXLInsertBuffer
> > (upto=upto@entry=112064880, tli=tli@entry=1, opportunistic=false) at
> > xlog.c:2224
> > #8  0x0000000000568398 in GetXLogBuffer (ptr=ptr@entry=112064880,
> > tli=tli@entry=1) at xlog.c:1710
> > #9  0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
> > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
> > EndPos=<optimized out>, tli=1)     at xlog.c:1245
> > #10 XLogInsertRecord (rdata=rdata@entry=0xcc49b0 <hdr_rdt>,
> > fpw_lsn=fpw_lsn@entry=112025520, flags=0 '\000', num_fpi=<optimized
> > out>, num_fpi@entry=0,      topxid_included=false) at xlog.c:928
> > #11 0x000000000056b870 in XLogInsert (rmid=rmid@entry=16 '\020',
> > info=<optimized out>, info@entry=16 '\020') at xloginsert.c:523
> > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
> > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
> > parent=<optimized out>,
> >
> > So, yes, something looks really wrong with this patch.  Sounds
> > plausible to me that some other buildfarm animals could be stuck
> > without their owners knowing about it.  It's proving to be a good idea
> > to force a timeout value in the configuration file of these animals..
>
> Tom, Michael, thank you for the information.
> This patch will be better tested before next attempt.

It seems that I managed to reproduce the issue on my Raspberry PI 4.
After running our test suite in a loop for 2 days I found one timeout.

I have hypothesis on why it might happen.  We don't have protection
against two backends in parallel get ReservedPtr mapped to a single
XLog buffer.  I've talked to Yura off-list about that.  He pointer out
that XLogWrite() should issue a PANIC in that case, which we didn't
observe.  However, I'm not sure this analysis is complete.

One way or another, we need protection against this situation any way.
The updated patch is attached.  Now, after acquiring ReservedPtr it
waits till OldPageRqstPtr gets initialized.  Additionally I've to
implement more accurate calculation of OldPageRqstPtr.  I run tests
with new patch on my Raspberry in a loop.  Let's see how it goes.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v11-0001-Get-rid-of-WALBufMappingLock.patch

Re: Get rid of WALBufMappingLock

From

Michael Paquier

Date:

26 February, 04:03:57

On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
> It seems that I managed to reproduce the issue on my Raspberry PI 4.
> After running our test suite in a loop for 2 days I found one timeout.

Hmm.  It's surprising to not see a higher occurence.  My buildfarm
host has caught that on its first run after the patch, for two
different animals which are both on the same machine.

> One way or another, we need protection against this situation any way.
> The updated patch is attached.  Now, after acquiring ReservedPtr it
> waits till OldPageRqstPtr gets initialized.  Additionally I've to
> implement more accurate calculation of OldPageRqstPtr.  I run tests
> with new patch on my Raspberry in a loop.  Let's see how it goes.

Perhaps you'd prefer that I do more tests with your patch?  This is
time-consuming for you.  This is not a review of the internals of the
patch, and I cannot give you access to the host, but if my stuff is
the only place where we have a good reproducibility of the issue, I'm
OK to grab some time and run a couple of checks to avoid again a
freeze of the buildfarm.
--
Michael

Attachment

signature.asc

Re: Get rid of WALBufMappingLock

From

Andrey Borodin

Date:

26 February, 11:52:17

> On 25 Feb 2025, at 20:19, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
>

Hi!

One little piece of code looks suspicious to me. But I was not raising concern because I see similar code everywhere in
thecodebase. But know Kirill asked to me explain what is going on and I cannot. 

This seems to be relevant… so.

+    while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
   // Assume ConditionVariableBroadcast() happened here, but before next line
+        ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
+    ConditionVariableCancelSleep();

Won’t this sleep wait forever?

I see about 20 other occurrences of similar code, so, perhaps, everything is fine. But I would greatly appreciate a
littlepointers on why it works. 

Best regards, Andrey Borodin.

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

26 February, 14:48:47

Hi, Michael!

On Wed, Feb 26, 2025 at 3:04 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
> > It seems that I managed to reproduce the issue on my Raspberry PI 4.
> > After running our test suite in a loop for 2 days I found one timeout.
>
> Hmm.  It's surprising to not see a higher occurence.  My buildfarm
> host has caught that on its first run after the patch, for two
> different animals which are both on the same machine.
>
> > One way or another, we need protection against this situation any way.
> > The updated patch is attached.  Now, after acquiring ReservedPtr it
> > waits till OldPageRqstPtr gets initialized.  Additionally I've to
> > implement more accurate calculation of OldPageRqstPtr.  I run tests
> > with new patch on my Raspberry in a loop.  Let's see how it goes.
>
> Perhaps you'd prefer that I do more tests with your patch?  This is
> time-consuming for you.  This is not a review of the internals of the
> patch, and I cannot give you access to the host, but if my stuff is
> the only place where we have a good reproducibility of the issue, I'm
> OK to grab some time and run a couple of checks to avoid again a
> freeze of the buildfarm.

Thank you for offering the help.  Updated version of patch is attached
(I've added one memory barrier there just in case).  I would
appreciate if you could run it on batta, hachi or similar hardware.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v12-0001-Get-rid-of-WALBufMappingLock.patch

Re: Get rid of WALBufMappingLock

From

Michael Paquier

Date:

28 February, 10:27:29

On Wed, Feb 26, 2025 at 01:48:47PM +0200, Alexander Korotkov wrote:
> Thank you for offering the help.  Updated version of patch is attached
> (I've added one memory barrier there just in case).  I would
> appreciate if you could run it on batta, hachi or similar hardware.

Doing a revert of the revert done in 3fb58625d18f proves that
reproducing the error is not really difficult.  I've done a make
installcheck-world USE_MODULE_DB=1 -j N without assertions, and the
point that saw a failure quickly is one of the tests of pgbench:
PANIC:  could not find WAL buffer for 0/19D366

This one happened for the test "concurrent OID generation" and CREATE
TYPE.  Of course, as it is a race condition, it is random, but it's
taking me only a couple of minutes to see the original issue on my
buildfarm host.  With assertion failures enabled, same story, and same
failure from the pgbench TAP test.

Saying that, I have also done similar tests with your v12 for a couple
of hours and this looks stable under installcheck-world.  I can see
that you've reworked quite a bit the surroundings of InitializedFrom
in this one.  If you apply that once again at some point, the
buildfarm will be judge in the long-term, but I am rather confident by
saying that the situation looks better here, at least.

One thing I would consider doing if you want to gain confidence is
tests like the one I saw causing problems with pgbench, with DDL
patterns stressing specific paths like this CREATE TYPE case.
--
Michael

Attachment

signature.asc

Re: Get rid of WALBufMappingLock

From

Yura Sokolov

Date:

28 February, 15:12:36

26.02.2025 11:52, Andrey Borodin wrote:
>> On 25 Feb 2025, at 20:19, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>
>>
> 
> 
> Hi!
> 
> One little piece of code looks suspicious to me. But I was not raising concern because I see similar code everywhere
inthe codebase. But know Kirill asked to me explain what is going on and I cannot.
 
> 
> This seems to be relevant… so.
> 
> +    while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
>    // Assume ConditionVariableBroadcast() happened here, but before next line
> +        ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
> +    ConditionVariableCancelSleep();
> 
> Won’t this sleep wait forever?

Because ConditionVariableSleep doesn't sleep for the first time.
It just performs ConditionVariablePrepareToSleep and immediately returns.
So actual condition of `while` loop is checked at least twice before going
to sleep.

> 
> I see about 20 other occurrences of similar code, so, perhaps, everything is fine. But I would greatly appreciate a
littlepointers on why it works.
 

-------
regards
Yura Sokolov aka funny-falcon

Re: Get rid of WALBufMappingLock

From

Yura Sokolov

Date:

28 February, 15:24:14

26.02.2025 14:48, Alexander Korotkov пишет:
> Hi, Michael!
> 
> On Wed, Feb 26, 2025 at 3:04 AM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
>>> It seems that I managed to reproduce the issue on my Raspberry PI 4.
>>> After running our test suite in a loop for 2 days I found one timeout.
>>
>> Hmm.  It's surprising to not see a higher occurence.  My buildfarm
>> host has caught that on its first run after the patch, for two
>> different animals which are both on the same machine.
>>
>>> One way or another, we need protection against this situation any way.
>>> The updated patch is attached.  Now, after acquiring ReservedPtr it
>>> waits till OldPageRqstPtr gets initialized.  Additionally I've to
>>> implement more accurate calculation of OldPageRqstPtr.  I run tests
>>> with new patch on my Raspberry in a loop.  Let's see how it goes.
>>
>> Perhaps you'd prefer that I do more tests with your patch?  This is
>> time-consuming for you.  This is not a review of the internals of the
>> patch, and I cannot give you access to the host, but if my stuff is
>> the only place where we have a good reproducibility of the issue, I'm
>> OK to grab some time and run a couple of checks to avoid again a
>> freeze of the buildfarm.
> 
> Thank you for offering the help.  Updated version of patch is attached
> (I've added one memory barrier there just in case).  I would
> appreciate if you could run it on batta, hachi or similar hardware.

Good day, Alexander.

Checked your additions to patch. They're clear and robust.

-------
regards
Yura Sokolov aka funny-falcon

Re: Get rid of WALBufMappingLock

From

Álvaro Herrera

Date:

28 February, 16:13:23

On 2025-Feb-28, Michael Paquier wrote:

> Saying that, I have also done similar tests with your v12 for a couple
> of hours and this looks stable under installcheck-world.  I can see
> that you've reworked quite a bit the surroundings of InitializedFrom
> in this one.  If you apply that once again at some point, the
> buildfarm will be judge in the long-term, but I am rather confident by
> saying that the situation looks better here, at least.

Heh, no amount of testing can prove lack of bugs; but for sure "it looks
different now, so it must be correct" must be the weakest proof of
correctness I've heard of!

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."                               (Fotis)
              https://postgr.es/m/200606261359.k5QDxE2p004593@auth-smtp.hol.gr

Re: Get rid of WALBufMappingLock

From

Michael Paquier

Date:

28 February, 16:43:58

On Fri, Feb 28, 2025 at 02:13:23PM +0100, Álvaro Herrera wrote:
> On 2025-Feb-28, Michael Paquier wrote:
>> Saying that, I have also done similar tests with your v12 for a couple
>> of hours and this looks stable under installcheck-world.  I can see
>> that you've reworked quite a bit the surroundings of InitializedFrom
>> in this one.  If you apply that once again at some point, the
>> buildfarm will be judge in the long-term, but I am rather confident by
>> saying that the situation looks better here, at least.
>
> Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> different now, so it must be correct" must be the weakest proof of
> correctness I've heard of!

Err, okay.  I did use the word "stable" with tests rather than
"correct", and I implied upthread that I did not check the correctness
nor the internals of the patch.  If my words held the meaning you
are implying, well, my apologies for the confusion, I guess.  I only
tested the patch and it was stable while I've noticed a few diffs with
the previous version, but I did *not* check its internals at all, nor
do I mean that I endorse its logic.  I hope that's clear now.
--
Michael

Attachment

signature.asc

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

02 March, 14:58:36

On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2025-Feb-28, Michael Paquier wrote:
>
> > Saying that, I have also done similar tests with your v12 for a couple
> > of hours and this looks stable under installcheck-world.  I can see
> > that you've reworked quite a bit the surroundings of InitializedFrom
> > in this one.  If you apply that once again at some point, the
> > buildfarm will be judge in the long-term, but I am rather confident by
> > saying that the situation looks better here, at least.
>
> Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> different now, so it must be correct" must be the weakest proof of
> correctness I've heard of!

Michael just volunteered to help Yura and me with testing.  He wan't
intended to be reviewer.  And he reported that tests looks much more
stable now.  I think he is absolutely correct with this.

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

02 March, 14:59:17

On Fri, Feb 28, 2025 at 3:44 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Feb 28, 2025 at 02:13:23PM +0100, Álvaro Herrera wrote:
> > On 2025-Feb-28, Michael Paquier wrote:
> >> Saying that, I have also done similar tests with your v12 for a couple
> >> of hours and this looks stable under installcheck-world.  I can see
> >> that you've reworked quite a bit the surroundings of InitializedFrom
> >> in this one.  If you apply that once again at some point, the
> >> buildfarm will be judge in the long-term, but I am rather confident by
> >> saying that the situation looks better here, at least.
> >
> > Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> > different now, so it must be correct" must be the weakest proof of
> > correctness I've heard of!
>
> Err, okay.  I did use the word "stable" with tests rather than
> "correct", and I implied upthread that I did not check the correctness
> nor the internals of the patch.  If my words held the meaning you
> are implying, well, my apologies for the confusion, I guess.  I only
> tested the patch and it was stable while I've noticed a few diffs with
> the previous version, but I did *not* check its internals at all, nor
> do I mean that I endorse its logic.  I hope that's clear now.

Got it.  Michael, thank you very much for your help.

------
Regards,
Alexander Korotkov
Supabase

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

07 March, 18:08:47

On Sun, Mar 2, 2025 at 1:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2025-Feb-28, Michael Paquier wrote:
> >
> > > Saying that, I have also done similar tests with your v12 for a couple
> > > of hours and this looks stable under installcheck-world.  I can see
> > > that you've reworked quite a bit the surroundings of InitializedFrom
> > > in this one.  If you apply that once again at some point, the
> > > buildfarm will be judge in the long-term, but I am rather confident by
> > > saying that the situation looks better here, at least.
> >
> > Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> > different now, so it must be correct" must be the weakest proof of
> > correctness I've heard of!
>
> Michael just volunteered to help Yura and me with testing.  He wan't
> intended to be reviewer.  And he reported that tests looks much more
> stable now.  I think he is absolutely correct with this.

Nevertheless, I don't think the bug has gone in v12.  I managed to
reproduce it on my local Raspberry PI 4.  The attached version of
patch fixes the bug for me.  It adds memory barriers surrounding
pg_atomic_compare_exchange_u64().  That certainly not right given this
function should already provide full memory barrier semantics.  But my
investigation shows it doesn't.  I'm going to start a separate thread
about this.

Also, new version of patch contains fix of potential integer overflow
during OldPageRqstPtr computation sent off-list my me by Yura.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v13-0001-Get-rid-of-WALBufMappingLock.patch

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

14 March, 00:37:59

On Fri, Mar 7, 2025 at 5:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sun, Mar 2, 2025 at 1:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > On 2025-Feb-28, Michael Paquier wrote:
> > >
> > > > Saying that, I have also done similar tests with your v12 for a couple
> > > > of hours and this looks stable under installcheck-world.  I can see
> > > > that you've reworked quite a bit the surroundings of InitializedFrom
> > > > in this one.  If you apply that once again at some point, the
> > > > buildfarm will be judge in the long-term, but I am rather confident by
> > > > saying that the situation looks better here, at least.
> > >
> > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> > > different now, so it must be correct" must be the weakest proof of
> > > correctness I've heard of!
> >
> > Michael just volunteered to help Yura and me with testing.  He wan't
> > intended to be reviewer.  And he reported that tests looks much more
> > stable now.  I think he is absolutely correct with this.
>
> Nevertheless, I don't think the bug has gone in v12.  I managed to
> reproduce it on my local Raspberry PI 4.  The attached version of
> patch fixes the bug for me.  It adds memory barriers surrounding
> pg_atomic_compare_exchange_u64().  That certainly not right given this
> function should already provide full memory barrier semantics.  But my
> investigation shows it doesn't.  I'm going to start a separate thread
> about this.
>
> Also, new version of patch contains fix of potential integer overflow
> during OldPageRqstPtr computation sent off-list my me by Yura.

So, as we finally clarified CAS doesn't guarantee full memory barrier
on failure.  Also, it's not clear when barriers are guaranteed on
success.  In ARM without LSE implementation, read barrier is provided
before change of value and write barrier after change of value.  So,
it appears that both explicit barriers I've added are required.  This
revision also comes with format proof of the algorithm.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v14-0001-Get-rid-of-WALBufMappingLock.patch

Re: Get rid of WALBufMappingLock

From

Tomas Vondra

Date:

14 March, 17:30:01

Hi,

I've briefly looked at this patch this week, and done a bit of testing.
I don't have any comments about the correctness - it does seem correct
to me and I haven't noticed any crashes/issues, but I'm not familiar
with the WALBufMappingLock enough to have insightful opinions.

I have however decided to do a bit of benchmarking, to better understand
the possible benefits of the change. I happen to have access to an Azure
machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
can do ~1.5GB/s.

The benchmark script (attached) uses the workload mentioned by Andres
some time ago [1]

   SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));

with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
results look like this (this is throughput):

           |  8                 |  64                |  1024
  clients  |  master   patched  |  master   patched  |  master  patched
  ---------------------------------------------------------------------
        1  |   11864     12035  |    7419      7345  |     968      940
        4  |   26311     26919  |   12414     12308  |    1304     1293
        8  |   38742     39651  |   14316     14539  |    1348     1348
       16  |   57299     59917  |   15405     15871  |    1304     1279
       32  |   74857     82598  |   17589     17126  |    1233     1233
       48  |   87596     95495  |   18616     18160  |    1199     1227
       64  |   89982     97715  |   19033     18910  |    1196     1221
       96  |   92853    103448  |   19694     19706  |    1190     1210
      128  |   95392    103324  |   20085     19873  |    1188     1213
      160  |   94933    102236  |   20227     20323  |    1180     1214
      196  |   95933    103341  |   20448     20513  |    1188     1199

To put this into a perspective, this throughput relative to master:

  clients  |     8      64     1024
  ----------------------------------
        1  |  101%     99%      97%
        4  |  102%     99%      99%
        8  |  102%    102%     100%
       16  |  105%    103%      98%
       32  |  110%     97%     100%
       48  |  109%     98%     102%
       64  |  109%     99%     102%
       96  |  111%    100%     102%
      128  |  108%     99%     102%
      160  |  108%    100%     103%
      196  |  108%    100%     101%

That does not seem like a huge improvement :-( Yes, there's 1-10%
speedup for the small (8K) size, but for larger chunks it's a wash.

Looking at the pgbench progress, I noticed stuff like this:

...
progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
...

i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
things (it's a virtual SSD, I'm not sure what model is that behind the
curtains). So I decided to try running the benchmark on tmpfs, to get
the storage out of the way and get the "best case" results.

This makes the pgbench progress perfectly "smooth" (no jumps like in the
output above), and the comparison looks like this:

           |  8                  |  64                | 1024
  clients  |  master    patched  |  master   patched  | master  patched
  ---------|---------------------|--------------------|----------------
        1  |   32449      32032  |   19289     20344  |   3108     3081
        4  |   68779      69256  |   24585     29912  |   2915     3449
        8  |   79787     100655  |   28217     39217  |   3182     4086
       16  |  113024     148968  |   42969     62083  |   5134     5712
       32  |  125884     170678  |   44256     71183  |   4910     5447
       48  |  125571     166695  |   44693     76411  |   4717     5215
       64  |  122096     160470  |   42749     83754  |   4631     5103
       96  |  120170     154145  |   42696     86529  |   4556     5020
      128  |  119204     152977  |   40880     88163  |   4529     5047
      160  |  116081     152708  |   42263     88066  |   4512     5000
      196  |  115364     152455  |   40765     88602  |   4505     4952

and the comparison to master:

  clients         8          64        1024
  -----------------------------------------
        1       99%        105%         99%
        4      101%        122%        118%
        8      126%        139%        128%
       16      132%        144%        111%
       32      136%        161%        111%
       48      133%        171%        111%
       64      131%        196%        110%
       96      128%        203%        110%
      128      128%        216%        111%
      160      132%        208%        111%
      196      132%        217%        110%

Yes, with tmpfs the impact looks much more significant. For 8K the
speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.


That being said, I wonder how big is the impact for practical workloads.
ISTM this workload is pretty narrow / extreme, it'd be much easier if we
had an example of a more realistic workload, benefiting from this. Of
course, it may be the case that there are multiple related bottlenecks,
and we'd need to fix all of them - in which case it'd be silly to block
the improvements on the grounds that it alone does not help.

Another thought is that this is testing the "good case". Can anyone
think of a workload that would be made worse by the patch?

regards

-- 
Tomas Vondra

Attachment

Re: Get rid of WALBufMappingLock

From

Yura Sokolov

Date:

21 March, 14:02:19

Good day, Tomas

14.03.2025 17:30, Tomas Vondra wrote:
> 
> Yes, with tmpfs the impact looks much more significant. For 8K the
> speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> 
> 
> That being said, I wonder how big is the impact for practical workloads.
> ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> had an example of a more realistic workload, benefiting from this. Of
> course, it may be the case that there are multiple related bottlenecks,
> and we'd need to fix all of them - in which case it'd be silly to block
> the improvements on the grounds that it alone does not help.

Yes, I found this bottleneck when I did experiments with increasing
NUM_XLOGINSERT_LOCKS [1]. For this patch to be more valuable, there should
be more parallel xlog inserters.

That is why I initially paired this patch with patch that reduces
contention on WALInsertLocks ("0002-Several attempts to lock
WALInsertLock", last version at [2]).

Certainly, largest bottleneck is WALWriteLock around writting buffers and
especially fsync-ing them after write. But this intermediate bottleneck of
WALBufMappingLock is also worth to be removed.

[1]
https://postgr.es/m/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
[2] https://postgr.es/m/c31158a3-7c26-4b26-90df-2df8f7bbe736%40postgrespro.ru

-------
regards
Yura Sokolov aka funny-falcon

Re: Get rid of WALBufMappingLock

From

Yura Sokolov

Date:

31 March, 13:42:01

Good day,

14.03.2025 17:30, Tomas Vondra wrote:
> Hi,
> 
> I've briefly looked at this patch this week, and done a bit of testing.
> I don't have any comments about the correctness - it does seem correct
> to me and I haven't noticed any crashes/issues, but I'm not familiar
> with the WALBufMappingLock enough to have insightful opinions.
> 
> I have however decided to do a bit of benchmarking, to better understand
> the possible benefits of the change. I happen to have access to an Azure
> machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> can do ~1.5GB/s.
> 
> The benchmark script (attached) uses the workload mentioned by Andres
> some time ago [1]
> 
>    SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> 
> with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> results look like this (this is throughput):
> 
>            |  8                 |  64                |  1024
>   clients  |  master   patched  |  master   patched  |  master  patched
>   ---------------------------------------------------------------------
>         1  |   11864     12035  |    7419      7345  |     968      940
>         4  |   26311     26919  |   12414     12308  |    1304     1293
>         8  |   38742     39651  |   14316     14539  |    1348     1348
>        16  |   57299     59917  |   15405     15871  |    1304     1279
>        32  |   74857     82598  |   17589     17126  |    1233     1233
>        48  |   87596     95495  |   18616     18160  |    1199     1227
>        64  |   89982     97715  |   19033     18910  |    1196     1221
>        96  |   92853    103448  |   19694     19706  |    1190     1210
>       128  |   95392    103324  |   20085     19873  |    1188     1213
>       160  |   94933    102236  |   20227     20323  |    1180     1214
>       196  |   95933    103341  |   20448     20513  |    1188     1199
> 
> To put this into a perspective, this throughput relative to master:
> 
>   clients  |     8      64     1024
>   ----------------------------------
>         1  |  101%     99%      97%
>         4  |  102%     99%      99%
>         8  |  102%    102%     100%
>        16  |  105%    103%      98%
>        32  |  110%     97%     100%
>        48  |  109%     98%     102%
>        64  |  109%     99%     102%
>        96  |  111%    100%     102%
>       128  |  108%     99%     102%
>       160  |  108%    100%     103%
>       196  |  108%    100%     101%
> 
> That does not seem like a huge improvement :-( Yes, there's 1-10%
> speedup for the small (8K) size, but for larger chunks it's a wash.
> 
> Looking at the pgbench progress, I noticed stuff like this:
> 
> ...
> progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> ...
> 
> i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> things (it's a virtual SSD, I'm not sure what model is that behind the
> curtains). So I decided to try running the benchmark on tmpfs, to get
> the storage out of the way and get the "best case" results.
> 
> This makes the pgbench progress perfectly "smooth" (no jumps like in the
> output above), and the comparison looks like this:
> 
>            |  8                  |  64                | 1024
>   clients  |  master    patched  |  master   patched  | master  patched
>   ---------|---------------------|--------------------|----------------
>         1  |   32449      32032  |   19289     20344  |   3108     3081
>         4  |   68779      69256  |   24585     29912  |   2915     3449
>         8  |   79787     100655  |   28217     39217  |   3182     4086
>        16  |  113024     148968  |   42969     62083  |   5134     5712
>        32  |  125884     170678  |   44256     71183  |   4910     5447
>        48  |  125571     166695  |   44693     76411  |   4717     5215
>        64  |  122096     160470  |   42749     83754  |   4631     5103
>        96  |  120170     154145  |   42696     86529  |   4556     5020
>       128  |  119204     152977  |   40880     88163  |   4529     5047
>       160  |  116081     152708  |   42263     88066  |   4512     5000
>       196  |  115364     152455  |   40765     88602  |   4505     4952
> 
> and the comparison to master:
> 
>   clients         8          64        1024
>   -----------------------------------------
>         1       99%        105%         99%
>         4      101%        122%        118%
>         8      126%        139%        128%
>        16      132%        144%        111%
>        32      136%        161%        111%
>        48      133%        171%        111%
>        64      131%        196%        110%
>        96      128%        203%        110%
>       128      128%        216%        111%
>       160      132%        208%        111%
>       196      132%        217%        110%
> 
> Yes, with tmpfs the impact looks much more significant. For 8K the
> speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> 
> 
> That being said, I wonder how big is the impact for practical workloads.
> ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> had an example of a more realistic workload, benefiting from this. Of
> course, it may be the case that there are multiple related bottlenecks,
> and we'd need to fix all of them - in which case it'd be silly to block
> the improvements on the grounds that it alone does not help.
> 
> Another thought is that this is testing the "good case". Can anyone
> think of a workload that would be made worse by the patch?

I've made similar benchmark on system with two Xeon Gold 5220R with two
Samsung SSD 970 PRO 1TB mirrored by md.

Configuration changes:
wal_sync_method = open_datasync
full_page_writes = off
synchronous_commit = off
checkpoint_timeout = 1d
max_connections = 1000
max_wal_size = 4GB
min_wal_size = 640MB

I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
1GB) and record size (1kB, 8kB and 64kB).

(I didn't bench 1MB record size, since I don't believe it is critical for
performance).

Here's results for 64MB segment size and 1GB wal_buffers:

+---------+---------+------------+--------------+----------+
| recsize | clients | master_tps | nowalbuf_tps | rel_perf |
+---------+---------+------------+--------------+----------+
| 1       | 1       | 47991.0    | 46995.0      | 0.98     |
| 1       | 4       | 171930.0   | 171166.0     | 1.0      |
| 1       | 16      | 491240.0   | 485132.0     | 0.99     |
| 1       | 64      | 514590.0   | 515534.0     | 1.0      |
| 1       | 128     | 547222.0   | 543543.0     | 0.99     |
| 1       | 256     | 543353.0   | 540802.0     | 1.0      |
| 8       | 1       | 40976.0    | 41603.0      | 1.02     |
| 8       | 4       | 89003.0    | 92008.0      | 1.03     |
| 8       | 16      | 90457.0    | 92282.0      | 1.02     |
| 8       | 64      | 89293.0    | 92022.0      | 1.03     |
| 8       | 128     | 92687.0    | 92768.0      | 1.0      |
| 8       | 256     | 91874.0    | 91665.0      | 1.0      |
| 64      | 1       | 11829.0    | 12031.0      | 1.02     |
| 64      | 4       | 11959.0    | 12832.0      | 1.07     |
| 64      | 16      | 11331.0    | 13417.0      | 1.18     |
| 64      | 64      | 11108.0    | 13588.0      | 1.22     |
| 64      | 128     | 11089.0    | 13648.0      | 1.23     |
| 64      | 256     | 10381.0    | 13542.0      | 1.3      |
+---------+---------+------------+--------------+----------+

Numbers for all configurations in attached 'improvements.out' . It shows,
removing WALBufMappingLock almost always doesn't harm performance and
usually gives measurable gain.

(Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
maximum tps from 6 runs and took average from remaining).

Also sqlite database is attached with all results. It also contains results
for patch "Several attempts to lock WALInsertLock" (named "attempts") and
cumulative patch ("nowalbuf-attempts").
Suprisingly, "Several attempts" causes measurable impact in some
configurations with hundreds of clients. So, there're more bottlenecks ahead ))


Yes, it is still not "real-world" benchmark. But it at least shows patch is
harmless.

-- 
regards
Yura Sokolov aka funny-falcon

Attachment

Re: Get rid of WALBufMappingLock

From

Alexander Korotkov

Date:

31 March, 21:18:30

On Mon, Mar 31, 2025 at 1:42 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> 14.03.2025 17:30, Tomas Vondra wrote:
> > Hi,
> >
> > I've briefly looked at this patch this week, and done a bit of testing.
> > I don't have any comments about the correctness - it does seem correct
> > to me and I haven't noticed any crashes/issues, but I'm not familiar
> > with the WALBufMappingLock enough to have insightful opinions.
> >
> > I have however decided to do a bit of benchmarking, to better understand
> > the possible benefits of the change. I happen to have access to an Azure
> > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> > can do ~1.5GB/s.
> >
> > The benchmark script (attached) uses the workload mentioned by Andres
> > some time ago [1]
> >
> >    SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> >
> > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> > results look like this (this is throughput):
> >
> >            |  8                 |  64                |  1024
> >   clients  |  master   patched  |  master   patched  |  master  patched
> >   ---------------------------------------------------------------------
> >         1  |   11864     12035  |    7419      7345  |     968      940
> >         4  |   26311     26919  |   12414     12308  |    1304     1293
> >         8  |   38742     39651  |   14316     14539  |    1348     1348
> >        16  |   57299     59917  |   15405     15871  |    1304     1279
> >        32  |   74857     82598  |   17589     17126  |    1233     1233
> >        48  |   87596     95495  |   18616     18160  |    1199     1227
> >        64  |   89982     97715  |   19033     18910  |    1196     1221
> >        96  |   92853    103448  |   19694     19706  |    1190     1210
> >       128  |   95392    103324  |   20085     19873  |    1188     1213
> >       160  |   94933    102236  |   20227     20323  |    1180     1214
> >       196  |   95933    103341  |   20448     20513  |    1188     1199
> >
> > To put this into a perspective, this throughput relative to master:
> >
> >   clients  |     8      64     1024
> >   ----------------------------------
> >         1  |  101%     99%      97%
> >         4  |  102%     99%      99%
> >         8  |  102%    102%     100%
> >        16  |  105%    103%      98%
> >        32  |  110%     97%     100%
> >        48  |  109%     98%     102%
> >        64  |  109%     99%     102%
> >        96  |  111%    100%     102%
> >       128  |  108%     99%     102%
> >       160  |  108%    100%     103%
> >       196  |  108%    100%     101%
> >
> > That does not seem like a huge improvement :-( Yes, there's 1-10%
> > speedup for the small (8K) size, but for larger chunks it's a wash.
> >
> > Looking at the pgbench progress, I noticed stuff like this:
> >
> > ...
> > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> > ...
> >
> > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> > things (it's a virtual SSD, I'm not sure what model is that behind the
> > curtains). So I decided to try running the benchmark on tmpfs, to get
> > the storage out of the way and get the "best case" results.
> >
> > This makes the pgbench progress perfectly "smooth" (no jumps like in the
> > output above), and the comparison looks like this:
> >
> >            |  8                  |  64                | 1024
> >   clients  |  master    patched  |  master   patched  | master  patched
> >   ---------|---------------------|--------------------|----------------
> >         1  |   32449      32032  |   19289     20344  |   3108     3081
> >         4  |   68779      69256  |   24585     29912  |   2915     3449
> >         8  |   79787     100655  |   28217     39217  |   3182     4086
> >        16  |  113024     148968  |   42969     62083  |   5134     5712
> >        32  |  125884     170678  |   44256     71183  |   4910     5447
> >        48  |  125571     166695  |   44693     76411  |   4717     5215
> >        64  |  122096     160470  |   42749     83754  |   4631     5103
> >        96  |  120170     154145  |   42696     86529  |   4556     5020
> >       128  |  119204     152977  |   40880     88163  |   4529     5047
> >       160  |  116081     152708  |   42263     88066  |   4512     5000
> >       196  |  115364     152455  |   40765     88602  |   4505     4952
> >
> > and the comparison to master:
> >
> >   clients         8          64        1024
> >   -----------------------------------------
> >         1       99%        105%         99%
> >         4      101%        122%        118%
> >         8      126%        139%        128%
> >        16      132%        144%        111%
> >        32      136%        161%        111%
> >        48      133%        171%        111%
> >        64      131%        196%        110%
> >        96      128%        203%        110%
> >       128      128%        216%        111%
> >       160      132%        208%        111%
> >       196      132%        217%        110%
> >
> > Yes, with tmpfs the impact looks much more significant. For 8K the
> > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> >
> >
> > That being said, I wonder how big is the impact for practical workloads.
> > ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> > had an example of a more realistic workload, benefiting from this. Of
> > course, it may be the case that there are multiple related bottlenecks,
> > and we'd need to fix all of them - in which case it'd be silly to block
> > the improvements on the grounds that it alone does not help.
> >
> > Another thought is that this is testing the "good case". Can anyone
> > think of a workload that would be made worse by the patch?
>
> I've made similar benchmark on system with two Xeon Gold 5220R with two
> Samsung SSD 970 PRO 1TB mirrored by md.
>
> Configuration changes:
> wal_sync_method = open_datasync
> full_page_writes = off
> synchronous_commit = off
> checkpoint_timeout = 1d
> max_connections = 1000
> max_wal_size = 4GB
> min_wal_size = 640MB
>
> I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
> 1GB) and record size (1kB, 8kB and 64kB).
>
> (I didn't bench 1MB record size, since I don't believe it is critical for
> performance).
>
> Here's results for 64MB segment size and 1GB wal_buffers:
>
> +---------+---------+------------+--------------+----------+
> | recsize | clients | master_tps | nowalbuf_tps | rel_perf |
> +---------+---------+------------+--------------+----------+
> | 1       | 1       | 47991.0    | 46995.0      | 0.98     |
> | 1       | 4       | 171930.0   | 171166.0     | 1.0      |
> | 1       | 16      | 491240.0   | 485132.0     | 0.99     |
> | 1       | 64      | 514590.0   | 515534.0     | 1.0      |
> | 1       | 128     | 547222.0   | 543543.0     | 0.99     |
> | 1       | 256     | 543353.0   | 540802.0     | 1.0      |
> | 8       | 1       | 40976.0    | 41603.0      | 1.02     |
> | 8       | 4       | 89003.0    | 92008.0      | 1.03     |
> | 8       | 16      | 90457.0    | 92282.0      | 1.02     |
> | 8       | 64      | 89293.0    | 92022.0      | 1.03     |
> | 8       | 128     | 92687.0    | 92768.0      | 1.0      |
> | 8       | 256     | 91874.0    | 91665.0      | 1.0      |
> | 64      | 1       | 11829.0    | 12031.0      | 1.02     |
> | 64      | 4       | 11959.0    | 12832.0      | 1.07     |
> | 64      | 16      | 11331.0    | 13417.0      | 1.18     |
> | 64      | 64      | 11108.0    | 13588.0      | 1.22     |
> | 64      | 128     | 11089.0    | 13648.0      | 1.23     |
> | 64      | 256     | 10381.0    | 13542.0      | 1.3      |
> +---------+---------+------------+--------------+----------+
>
> Numbers for all configurations in attached 'improvements.out' . It shows,
> removing WALBufMappingLock almost always doesn't harm performance and
> usually gives measurable gain.
>
> (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
> maximum tps from 6 runs and took average from remaining).
>
> Also sqlite database is attached with all results. It also contains results
> for patch "Several attempts to lock WALInsertLock" (named "attempts") and
> cumulative patch ("nowalbuf-attempts").
> Suprisingly, "Several attempts" causes measurable impact in some
> configurations with hundreds of clients. So, there're more bottlenecks ahead ))
>
>
> Yes, it is still not "real-world" benchmark. But it at least shows patch is
> harmless.

Thank you for your experiments.  Your results shows up to 30% speedups
on real hardware, not tmpfs.  While this is still a corner case, I
think this is quite a results for a pretty local optimization.  On
small connection number there are some cases above and below 1.0.  I
think this due to statistical error.  If we would calculate average
tps ratio across different experiments, for low number of clients it's
still above 1.0.

sqlite> select clients, avg(ratio) from (select walseg, walbuf,
recsize, clients, (avg(tps) filter (where branch =
'nowalbuf'))/(avg(tps) filter (where branch = 'master')) as ratio from
results where branch in ('master', 'nowalbuf') group by walseg,
walbuf, recsize, clients) x group by clients;
1|1.00546614169766
4|1.00782085856889
16|1.02257892337757
64|1.04400167838906
128|1.04134006876033
256|1.04627949500578

I'm going to push the first patch ("nowalbuf") if no objections.  I
think the second one ("Several attempts") still needs more work, as
there are regressions.

------
Regards,
Alexander Korotkov
Supabase