Thread: pg_atomic_compare_exchange_*() and memory barriers
Hi all,
While investigating a bug in the patch to get rid of WALBufMappingLock, I found that the surrounding pg_atomic_compare_exchange_u64() fixes the problem for me. That doesn't feel right because, according to the comments, both pg_atomic_compare_exchange_u32() and pg_atomic_compare_exchange_u64() should provide full memory barrier semantics. So, I decided to investigate this further.
In my case, the implementation of pg_atomic_compare_exchange_u64() is based on __atomic_compare_exchange_n().
static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}
According to the docs __ATOMIC_SEQ_CST enforces total ordering with all other __ATOMIC_SEQ_CST operations. It only says about other __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. This sounds quite far from our comment, promising full barrier semantics, which, in my understanding, means the equivalent of both pg_read_barrier()/pg_write_barrier(), which should barrier all other read/writes.
This function first checks if LSE instructions are present. If so, instruction LSE casal is used. If not (in my case), the loop of ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr provides one-way barriers. Read/writes after ldaxr will be observed after, and read/writes before stlxr will be observed before. So, a pair of these instructions provides a full memory barrier. However, if we don't observe the expected value, only ldaxr gets executed. So, an unsuccessful pg_atomic_compare_exchange_u64() attempt will leave us with a one-way barrier, and that caused a bug in my case. In contrast, __sync_val_compare_and_swap() uses __aarch64_cas8_sync, which is quite the same as __aarch64_cas8_acq_rel, but has an explicit memory barrier in the end.
While investigating a bug in the patch to get rid of WALBufMappingLock, I found that the surrounding pg_atomic_compare_exchange_u64() fixes the problem for me. That doesn't feel right because, according to the comments, both pg_atomic_compare_exchange_u32() and pg_atomic_compare_exchange_u64() should provide full memory barrier semantics. So, I decided to investigate this further.
In my case, the implementation of pg_atomic_compare_exchange_u64() is based on __atomic_compare_exchange_n().
static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}
According to the docs __ATOMIC_SEQ_CST enforces total ordering with all other __ATOMIC_SEQ_CST operations. It only says about other __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. This sounds quite far from our comment, promising full barrier semantics, which, in my understanding, means the equivalent of both pg_read_barrier()/pg_write_barrier(), which should barrier all other read/writes.
I've found that __atomic_compare_exchange_n() ends up with usage of __aarch64_cas8_acq_rel, which disassembles as following.
0000000000000860 <__aarch64_cas8_acq_rel>:
860: d503245f bti c
864: 90000110 adrp x16, 20000 <__data_start>
868: 3940a210 ldrb w16, [x16, #40]
86c: 34000070 cbz w16, 878 <__aarch64_cas8_acq_rel+0x18>
870: c8e0fc41 casal x0, x1, [x2]
874: d65f03c0 ret
878: aa0003f0 mov x16, x0
87c: c85ffc40 ldaxr x0, [x2]
880: eb10001f cmp x0, x16
884: 54000061 b.ne 890 <__aarch64_cas8_acq_rel+0x30> // b.any
888: c811fc41 stlxr w17, x1, [x2]
88c: 35ffff91 cbnz w17, 87c <__aarch64_cas8_acq_rel+0x1c>
890: d65f03c0 ret
860: d503245f bti c
864: 90000110 adrp x16, 20000 <__data_start>
868: 3940a210 ldrb w16, [x16, #40]
86c: 34000070 cbz w16, 878 <__aarch64_cas8_acq_rel+0x18>
870: c8e0fc41 casal x0, x1, [x2]
874: d65f03c0 ret
878: aa0003f0 mov x16, x0
87c: c85ffc40 ldaxr x0, [x2]
880: eb10001f cmp x0, x16
884: 54000061 b.ne 890 <__aarch64_cas8_acq_rel+0x30> // b.any
888: c811fc41 stlxr w17, x1, [x2]
88c: 35ffff91 cbnz w17, 87c <__aarch64_cas8_acq_rel+0x1c>
890: d65f03c0 ret
I have a couple of ideas on how to fix this problem.
1. Provide full barrier semantics for pg_atomic_compare_exchange_*(). In particular, for __atomic_compare_exchange_n(), we should probably use __ATOMIC_ACQ_REL for success and run an explicit memory barrier in the case of failure.
2. Document that pg_atomic_compare_exchange_*() doesn't necessarily provide a memory barrier on failure. But then we need to carefully check if existing use-cases need explicit memory barriers now.
Any thoughts?
2. https://developer.arm.com/documentation/100941/0101/Barriers
------
Regards,
Alexander Korotkov
Supabase
------
Regards,
Alexander Korotkov
Supabase
Hi, On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > While investigating a bug in the patch to get rid of WALBufMappingLock, I > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > problem for me. That sounds more likely to be due to slowing down things enough to make a race less likely to be hit. Or a compiler bug. > That doesn't feel right because, according to the > comments, both pg_atomic_compare_exchange_u32() and > pg_atomic_compare_exchange_u64() should provide full memory barrier > semantics. So, I decided to investigate this further. > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > on __atomic_compare_exchange_n(). > > static inline bool > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > uint64 *expected, uint64 newval) > { > AssertPointerAlignment(expected, 8); > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > } > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > other __ATOMIC_SEQ_CST operations*. It only says about other > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > This sounds quite far from our comment, promising full barrier semantics, > which, in my understanding, means the equivalent of > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > read/writes*. A memory barrier in one process/thread always needs be paired with a barrier in another process/thread. It's not enough to have a memory barrier on one side but not the other. __ATOMIC_SEQ_CST is used to implement the C11/C++11 barriers, and for those it's more clearly formulated that that include acquire/release semantics. The standard formulation is a bit more complicated IIRC, but here's the cppreference.com simplification: https://en.cppreference.com/w/c/atomic/memory_order > A load operation with this memory order performs an acquire operation, a > store performs a release operation, and read-modify-write performs both an > acquire operation and a release operation, plus a single total order exists > in which all threads observe all modifications in the same order (see > Sequentially-consistent ordering below). > > This function first checks if LSE instructions are present. If so, > instruction LSE casal is used. If not (in my case), the loop of > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr > provides one-way barriers. Read/writes after ldaxr will be observed after, > and read/writes before stlxr will be observed before. So, a pair of these > instructions provides a full memory barrier. However, if we don't observe > the expected value, only ldaxr gets executed. So, an unsuccessful > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way > barrier, and that caused a bug in my case. That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both success_memorder *and* failure_memorder. What compiler & version is this? > I have a couple of ideas on how to fix this problem. > 1. Provide full barrier semantics for pg_atomic_compare_exchange_*(). In > particular, for __atomic_compare_exchange_n(), we should probably > use __ATOMIC_ACQ_REL for success and run an explicit memory barrier in the > case of failure. I don't follow - __ATOMIC_SEQ_CST is *stronger* than __ATOMIC_ACQ_REL. Greetings, Andres Freund
Hi, Andres. Thank you for reply. On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > > While investigating a bug in the patch to get rid of WALBufMappingLock, I > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > > problem for me. > > That sounds more likely to be due to slowing down things enough to make a race > less likely to be hit. Or a compiler bug. > > > > That doesn't feel right because, according to the > > comments, both pg_atomic_compare_exchange_u32() and > > pg_atomic_compare_exchange_u64() should provide full memory barrier > > semantics. So, I decided to investigate this further. > > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > > on __atomic_compare_exchange_n(). > > > > static inline bool > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > > uint64 *expected, uint64 newval) > > { > > AssertPointerAlignment(expected, 8); > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > > } > > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > > other __ATOMIC_SEQ_CST operations*. It only says about other > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > > This sounds quite far from our comment, promising full barrier semantics, > > which, in my understanding, means the equivalent of > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > > read/writes*. > > A memory barrier in one process/thread always needs be paired with a barrier > in another process/thread. It's not enough to have a memory barrier on one > side but not the other. Yes, there surely should be a memory barrier on another side. But does __ATOMIC_SEQ_CST works as a barrier for the regular read/write operations on the same side? > __ATOMIC_SEQ_CST is used to implement the C11/C++11 barriers, and for those > it's more clearly formulated that that include acquire/release semantics. The > standard formulation is a bit more complicated IIRC, but here's the > cppreference.com simplification: > > https://en.cppreference.com/w/c/atomic/memory_order > > A load operation with this memory order performs an acquire operation, a > > store performs a release operation, and read-modify-write performs both an > > acquire operation and a release operation, plus a single total order exists > > in which all threads observe all modifications in the same order (see > > Sequentially-consistent ordering below). Thank you. This is not yet clear for me. I probably need to meditate more on this :) > > > This function first checks if LSE instructions are present. If so, > > instruction LSE casal is used. If not (in my case), the loop of > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr > > provides one-way barriers. Read/writes after ldaxr will be observed after, > > and read/writes before stlxr will be observed before. So, a pair of these > > instructions provides a full memory barrier. However, if we don't observe > > the expected value, only ldaxr gets executed. So, an unsuccessful > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way > > barrier, and that caused a bug in my case. > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both > success_memorder *and* failure_memorder. > > What compiler & version is this? I've tried and got the same results with two compilers. gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2) ------ Regards, Alexander Korotkov Supabase
Hi, On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote: > On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote: > > > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > > > problem for me. > > > > That sounds more likely to be due to slowing down things enough to make a race > > less likely to be hit. Or a compiler bug. > > > > > > > That doesn't feel right because, according to the > > > comments, both pg_atomic_compare_exchange_u32() and > > > pg_atomic_compare_exchange_u64() should provide full memory barrier > > > semantics. So, I decided to investigate this further. > > > > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > > > on __atomic_compare_exchange_n(). > > > > > > static inline bool > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > > > uint64 *expected, uint64 newval) > > > { > > > AssertPointerAlignment(expected, 8); > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > > > } > > > > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > > > other __ATOMIC_SEQ_CST operations*. It only says about other > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > > > This sounds quite far from our comment, promising full barrier semantics, > > > which, in my understanding, means the equivalent of > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > > > read/writes*. > > > > A memory barrier in one process/thread always needs be paired with a barrier > > in another process/thread. It's not enough to have a memory barrier on one > > side but not the other. > > Yes, there surely should be a memory barrier on another side. But > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write > operations on the same side? Yes, if it's paired with another barrier on the other side. The regular read/write operations are basically protected transitively, due to a) An acquire barrier preventing non-atomic reads/writes in the same thread from appearing to have been moved to before the barrier b) A release barrier preventing non-atomic reads/writes in the same thread from appearing to have been moved to after the barrier c) The other thread being guaranteed a) and b) for the other threads' non-atomic reads/writes depending on the type of barrier d) The atomic value itself being guaranteed to be, well, atomic > > > This function first checks if LSE instructions are present. If so, > > > instruction LSE casal is used. If not (in my case), the loop of > > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr > > > provides one-way barriers. Read/writes after ldaxr will be observed after, > > > and read/writes before stlxr will be observed before. So, a pair of these > > > instructions provides a full memory barrier. However, if we don't observe > > > the expected value, only ldaxr gets executed. So, an unsuccessful > > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way > > > barrier, and that caused a bug in my case. > > > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both > > success_memorder *and* failure_memorder. > > > > What compiler & version is this? > > I've tried and got the same results with two compilers. > gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 > Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2) Thinking more about it I wonder if the behaviour of not doing a release in case the atomic fails *might* arguably actually be correct - if the compare-exchange fails, there's nothing that the non-atomic values could be ordered against. What is the access pattern and the observed problems with it that made you look at the disassembly? Greetings, Andres Freund
Hi, Alexander and Andres! On Fri, 7 Mar 2025 at 20:40, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi, Andres. > > Thank you for reply. > > On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote: > > > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > > > problem for me. > > > > That sounds more likely to be due to slowing down things enough to make a race > > less likely to be hit. Or a compiler bug. > > > > > > > That doesn't feel right because, according to the > > > comments, both pg_atomic_compare_exchange_u32() and > > > pg_atomic_compare_exchange_u64() should provide full memory barrier > > > semantics. So, I decided to investigate this further. > > > > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > > > on __atomic_compare_exchange_n(). > > > > > > static inline bool > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > > > uint64 *expected, uint64 newval) > > > { > > > AssertPointerAlignment(expected, 8); > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > > > } > > > > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > > > other __ATOMIC_SEQ_CST operations*. It only says about other > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > > > This sounds quite far from our comment, promising full barrier semantics, > > > which, in my understanding, means the equivalent of > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > > > read/writes*. > > > > A memory barrier in one process/thread always needs be paired with a barrier > > in another process/thread. It's not enough to have a memory barrier on one > > side but not the other. > > Yes, there surely should be a memory barrier on another side. But > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write > operations on the same side? According to a reference posted by Andres [1]: "Within a thread of execution, accesses (reads and writes) through volatile lvalues cannot be reordered past observable side-effects (including other volatile accesses) that are separated by a sequence point within the same thread, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization." Also: "as soon as atomic operations that are not tagged memory_order_seq_cst enter the picture, the sequential consistency is lost" So I think Alexander is right in the original post. The order of everything not marked as ATOMIC_SEQ_CST (atomic or non-atomic operations) compared to atomic operations marked by ATOMIC_SEQ_CST is not warranted. > > __ATOMIC_SEQ_CST is used to implement the C11/C++11 barriers, and for those > > it's more clearly formulated that that include acquire/release semantics. The > > standard formulation is a bit more complicated IIRC, but here's the > > cppreference.com simplification: > > > > https://en.cppreference.com/w/c/atomic/memory_order > > > A load operation with this memory order performs an acquire operation, a > > > store performs a release operation, and read-modify-write performs both an > > > acquire operation and a release operation, plus a single total order exists > > > in which all threads observe all modifications in the same order (see > > > Sequentially-consistent ordering below). > > > Thank you. This is not yet clear for me. I probably need to meditate > more on this :) I think __ATOMIC_ACQ_REL is needed for success, and __ATOMIC_ACQUIRE might be sufficient for failure as failure of __atomic_compare_exchange_n contains no modification of atomic variable, so we want to see all modifications from the concurrent operations, but will not enforce them to see any "our" change. [2] Maybe I'm too optimistic here and having a full memory barrier is better. Thank you very much for considering this issue! I think the right operation of atomic primitives is both very very important and also hard to check thoroughly. I think the test that could reproduce the reordering of atomic and non-atomic operations could be very useful in regression set. [1] https://en.cppreference.com/w/c/atomic/memory_order#Sequentially-consistent_ordering [2] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html Best regards, Pavel Borisov Supabase
On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:
> > On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:
> > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I
> > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the
> > > > problem for me.
> > >
> > > That sounds more likely to be due to slowing down things enough to make a race
> > > less likely to be hit. Or a compiler bug.
> > >
> > >
> > > > That doesn't feel right because, according to the
> > > > comments, both pg_atomic_compare_exchange_u32() and
> > > > pg_atomic_compare_exchange_u64() should provide full memory barrier
> > > > semantics. So, I decided to investigate this further.
> > > >
> > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based
> > > > on __atomic_compare_exchange_n().
> > > >
> > > > static inline bool
> > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
> > > > uint64 *expected, uint64 newval)
> > > > {
> > > > AssertPointerAlignment(expected, 8);
> > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
> > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> > > > }
> > > >
> > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
> > > > other __ATOMIC_SEQ_CST operations*. It only says about other
> > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
> > > > This sounds quite far from our comment, promising full barrier semantics,
> > > > which, in my understanding, means the equivalent of
> > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
> > > > read/writes*.
> > >
> > > A memory barrier in one process/thread always needs be paired with a barrier
> > > in another process/thread. It's not enough to have a memory barrier on one
> > > side but not the other.
> >
> > Yes, there surely should be a memory barrier on another side. But
> > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
> > operations on the same side?
>
> Yes, if it's paired with another barrier on the other side. The regular
> read/write operations are basically protected transitively, due to
>
> a) An acquire barrier preventing non-atomic reads/writes in the same thread
> from appearing to have been moved to before the barrier
>
> b) A release barrier preventing non-atomic reads/writes in the same thread
> from appearing to have been moved to after the barrier
>
> c) The other thread being guaranteed a) and b) for the other threads'
> non-atomic reads/writes depending on the type of barrier
>
> d) The atomic value itself being guaranteed to be, well, atomic
Yes, looks good to me.
> > > > This function first checks if LSE instructions are present. If so,
> > > > instruction LSE casal is used. If not (in my case), the loop of
> > > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
> > > > provides one-way barriers. Read/writes after ldaxr will be observed after,
> > > > and read/writes before stlxr will be observed before. So, a pair of these
> > > > instructions provides a full memory barrier. However, if we don't observe
> > > > the expected value, only ldaxr gets executed. So, an unsuccessful
> > > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
> > > > barrier, and that caused a bug in my case.
> > >
> > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
> > > success_memorder *and* failure_memorder.
> > >
> > > What compiler & version is this?
> >
> > I've tried and got the same results with two compilers.
> > gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
> > Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)
>
> Thinking more about it I wonder if the behaviour of not doing a release in
> case the atomic fails *might* arguably actually be correct - if the
> compare-exchange fails, there's nothing that the non-atomic values could be
> ordered against.
>
> What is the access pattern and the observed problems with it that made you
> look at the disassembly?
Check this code.
l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend which
* will initialize nextidx.
*/
ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}
Consider the following execution order with process 1 (p1) and process 2 (p2).
> On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:
> > On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:
> > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I
> > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the
> > > > problem for me.
> > >
> > > That sounds more likely to be due to slowing down things enough to make a race
> > > less likely to be hit. Or a compiler bug.
> > >
> > >
> > > > That doesn't feel right because, according to the
> > > > comments, both pg_atomic_compare_exchange_u32() and
> > > > pg_atomic_compare_exchange_u64() should provide full memory barrier
> > > > semantics. So, I decided to investigate this further.
> > > >
> > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based
> > > > on __atomic_compare_exchange_n().
> > > >
> > > > static inline bool
> > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
> > > > uint64 *expected, uint64 newval)
> > > > {
> > > > AssertPointerAlignment(expected, 8);
> > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
> > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> > > > }
> > > >
> > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
> > > > other __ATOMIC_SEQ_CST operations*. It only says about other
> > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
> > > > This sounds quite far from our comment, promising full barrier semantics,
> > > > which, in my understanding, means the equivalent of
> > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
> > > > read/writes*.
> > >
> > > A memory barrier in one process/thread always needs be paired with a barrier
> > > in another process/thread. It's not enough to have a memory barrier on one
> > > side but not the other.
> >
> > Yes, there surely should be a memory barrier on another side. But
> > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
> > operations on the same side?
>
> Yes, if it's paired with another barrier on the other side. The regular
> read/write operations are basically protected transitively, due to
>
> a) An acquire barrier preventing non-atomic reads/writes in the same thread
> from appearing to have been moved to before the barrier
>
> b) A release barrier preventing non-atomic reads/writes in the same thread
> from appearing to have been moved to after the barrier
>
> c) The other thread being guaranteed a) and b) for the other threads'
> non-atomic reads/writes depending on the type of barrier
>
> d) The atomic value itself being guaranteed to be, well, atomic
Yes, looks good to me.
> > > > This function first checks if LSE instructions are present. If so,
> > > > instruction LSE casal is used. If not (in my case), the loop of
> > > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
> > > > provides one-way barriers. Read/writes after ldaxr will be observed after,
> > > > and read/writes before stlxr will be observed before. So, a pair of these
> > > > instructions provides a full memory barrier. However, if we don't observe
> > > > the expected value, only ldaxr gets executed. So, an unsuccessful
> > > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
> > > > barrier, and that caused a bug in my case.
> > >
> > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
> > > success_memorder *and* failure_memorder.
> > >
> > > What compiler & version is this?
> >
> > I've tried and got the same results with two compilers.
> > gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
> > Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)
>
> Thinking more about it I wonder if the behaviour of not doing a release in
> case the atomic fails *might* arguably actually be correct - if the
> compare-exchange fails, there's nothing that the non-atomic values could be
> ordered against.
>
> What is the access pattern and the observed problems with it that made you
> look at the disassembly?
Check this code.
l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend which
* will initialize nextidx.
*/
ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}
Consider the following execution order with process 1 (p1) and process 2 (p2).
1. p1 executes l1
2. p2 executes l2 with success
3. p1 executes l2 with failure
4. p2 execute l3, but doesn't see the results of step 1, because 3 didn't provide enough of memory barrier
------
Regards,
Alexander Korotkov
------
Regards,
Alexander Korotkov
Supabase
On Fri, Mar 7, 2025 at 7:15 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote: > > > On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > > > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I > > > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > > > > > problem for me. > > > > > > > > That sounds more likely to be due to slowing down things enough to make a race > > > > less likely to be hit. Or a compiler bug. > > > > > > > > > > > > > That doesn't feel right because, according to the > > > > > comments, both pg_atomic_compare_exchange_u32() and > > > > > pg_atomic_compare_exchange_u64() should provide full memory barrier > > > > > semantics. So, I decided to investigate this further. > > > > > > > > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > > > > > on __atomic_compare_exchange_n(). > > > > > > > > > > static inline bool > > > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > > > > > uint64 *expected, uint64 newval) > > > > > { > > > > > AssertPointerAlignment(expected, 8); > > > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > > > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > > > > > } > > > > > > > > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > > > > > other __ATOMIC_SEQ_CST operations*. It only says about other > > > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > > > > > This sounds quite far from our comment, promising full barrier semantics, > > > > > which, in my understanding, means the equivalent of > > > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > > > > > read/writes*. > > > > > > > > A memory barrier in one process/thread always needs be paired with a barrier > > > > in another process/thread. It's not enough to have a memory barrier on one > > > > side but not the other. > > > > > > Yes, there surely should be a memory barrier on another side. But > > > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write > > > operations on the same side? > > > > Yes, if it's paired with another barrier on the other side. The regular > > read/write operations are basically protected transitively, due to > > > > a) An acquire barrier preventing non-atomic reads/writes in the same thread > > from appearing to have been moved to before the barrier > > > > b) A release barrier preventing non-atomic reads/writes in the same thread > > from appearing to have been moved to after the barrier > > > > c) The other thread being guaranteed a) and b) for the other threads' > > non-atomic reads/writes depending on the type of barrier > > > > d) The atomic value itself being guaranteed to be, well, atomic > > Yes, looks good to me. > > > > > > This function first checks if LSE instructions are present. If so, > > > > > instruction LSE casal is used. If not (in my case), the loop of > > > > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr > > > > > provides one-way barriers. Read/writes after ldaxr will be observed after, > > > > > and read/writes before stlxr will be observed before. So, a pair of these > > > > > instructions provides a full memory barrier. However, if we don't observe > > > > > the expected value, only ldaxr gets executed. So, an unsuccessful > > > > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way > > > > > barrier, and that caused a bug in my case. > > > > > > > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both > > > > success_memorder *and* failure_memorder. > > > > > > > > What compiler & version is this? > > > > > > I've tried and got the same results with two compilers. > > > gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 > > > Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2) > > > > Thinking more about it I wonder if the behaviour of not doing a release in > > case the atomic fails *might* arguably actually be correct - if the > > compare-exchange fails, there's nothing that the non-atomic values could be > > ordered against. > > > > What is the access pattern and the observed problems with it that made you > > look at the disassembly? > > Check this code. > > l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); > > /* > * Try to advance XLogCtl->InitializedUpTo. > * > * If the CAS operation failed, then some of previous pages are not > * initialized yet, and this backend gives up. > * > * Since initializer of next page might give up on advancing of > * InitializedUpTo, this backend have to attempt advancing until it > * find page "in the past" or concurrent backend succeeded at > * advancing. When we finish advancing XLogCtl->InitializedUpTo, we > * notify all the waiters with XLogCtl->InitializedUpToCondVar. > */ > l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr)) > { > NewPageBeginPtr = NewPageEndPtr; > NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; > nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr); > > l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr) > { > /* > * Page at nextidx wasn't initialized yet, so we cann't move > * InitializedUpto further. It will be moved by backend which > * will initialize nextidx. > */ > ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); > break; > } > } > > Consider the following execution order with process 1 (p1) and process 2 (p2). > > 1. p1 executes l1 > 2. p2 executes l2 with success > 3. p1 executes l2 with failure > 4. p2 execute l3, but doesn't see the results of step 1, because 3 didn't provide enough of memory barrier Sorry, I messed this up. The correct sequence is following. 1. p1 executes l1 2. p1 executes l2 with failure 3. p2 executes l2 with success 4. p2 execute l3, but doesn't see the results of step 1, because 3 didn't provide enough of memory barrier Note that step 2 would be successful if it would be after step 3. ------ Regards, Alexander Korotkov Supabase
Hi, On 2025-03-07 21:08:34 +0400, Pavel Borisov wrote: > On Fri, 7 Mar 2025 at 20:40, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Yes, there surely should be a memory barrier on another side. But > > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write > > operations on the same side? > According to a reference posted by Andres [1]: > "Within a thread of execution, accesses (reads and writes) through > volatile lvalues cannot be reordered past observable side-effects > (including other volatile accesses) that are separated by a sequence > point within the same thread, but this order is not guaranteed to be > observed by another thread, since volatile access does not establish > inter-thread synchronization." How is volatile relevant here? > Also: "as soon as atomic operations that are not tagged > memory_order_seq_cst enter the picture, the sequential consistency is > lost" Sequential consistency is lost, but that does *NOT* mean that acquire/release guarantees that are also guaranteed by ATOMIC_SEQ_CST are lost. Greetings, Andres Freund
Hi, On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote: > On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote: > > What is the access pattern and the observed problems with it that made you > > look at the disassembly? > > Check this code. > > l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); > /* > * Try to advance XLogCtl->InitializedUpTo. > * > * If the CAS operation failed, then some of previous pages are not > * initialized yet, and this backend gives up. > * > * Since initializer of next page might give up on advancing of > * InitializedUpTo, this backend have to attempt advancing until it > * find page "in the past" or concurrent backend succeeded at > * advancing. When we finish advancing XLogCtl->InitializedUpTo, we > * notify all the waiters with XLogCtl->InitializedUpToCondVar. > */ > l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, > &NewPageBeginPtr, NewPageEndPtr)) > { > NewPageBeginPtr = NewPageEndPtr; > NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; > nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr); > > l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != > NewPageEndPtr) > { > /* > * Page at nextidx wasn't initialized yet, so we cann't move > * InitializedUpto further. It will be moved by backend > which > * will initialize nextidx. > */ > > ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); > break; > } > } > > Consider the following execution order with process 1 (p1) and process 2 > (p2). On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote: > Sorry, I messed this up. > The correct sequence is following. > > 1. p1 executes l1 > 2. p1 executes l2 with failure > 3. p2 executes l2 with success > 4. p2 execute l3, but doesn't see the results of step 1, because 3 > didn't provide enough of memory barrier Did you mean because 2) didn't provide enough of a memory barrier? Because 3) does, right? You could get in exactly same the situation if the p1 is scheduled out by the OS after step 1, no? Greetings, Andres Freund
On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote: > > On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote: > > > What is the access pattern and the observed problems with it that made you > > > look at the disassembly? > > > > Check this code. > > > > l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); > > > /* > > * Try to advance XLogCtl->InitializedUpTo. > > * > > * If the CAS operation failed, then some of previous pages are not > > * initialized yet, and this backend gives up. > > * > > * Since initializer of next page might give up on advancing of > > * InitializedUpTo, this backend have to attempt advancing until it > > * find page "in the past" or concurrent backend succeeded at > > * advancing. When we finish advancing XLogCtl->InitializedUpTo, we > > * notify all the waiters with XLogCtl->InitializedUpToCondVar. > > */ > > l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, > > &NewPageBeginPtr, NewPageEndPtr)) > > { > > NewPageBeginPtr = NewPageEndPtr; > > NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; > > nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr); > > > > l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != > > NewPageEndPtr) > > { > > /* > > * Page at nextidx wasn't initialized yet, so we cann't move > > * InitializedUpto further. It will be moved by backend > > which > > * will initialize nextidx. > > */ > > > > ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); > > break; > > } > > } > > > > Consider the following execution order with process 1 (p1) and process 2 > > (p2). > > On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote: > > Sorry, I messed this up. > > The correct sequence is following. > > > > 1. p1 executes l1 > > 2. p1 executes l2 with failure > > 3. p2 executes l2 with success > > 4. p2 execute l3, but doesn't see the results of step 1, because 3 > > didn't provide enough of memory barrier > > Did you mean because 2) didn't provide enough of a memory barrier? Because 3) > does, right? Yes, exactly. > You could get in exactly same the situation if the p1 is scheduled out by the > OS after step 1, no? No. In that case, p1 will execute l2 with success. p1 executes l2 with failure only because it goes before p2 executes l2. ------ Regards, Alexander Korotkov Supabase
On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote: > > On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote: > > > > While investigating a bug in the patch to get rid of WALBufMappingLock, I > > > > found that the surrounding pg_atomic_compare_exchange_u64() fixes the > > > > problem for me. > > > > > > That sounds more likely to be due to slowing down things enough to make a race > > > less likely to be hit. Or a compiler bug. > > > > > > > > > > That doesn't feel right because, according to the > > > > comments, both pg_atomic_compare_exchange_u32() and > > > > pg_atomic_compare_exchange_u64() should provide full memory barrier > > > > semantics. So, I decided to investigate this further. > > > > > > > > In my case, the implementation of pg_atomic_compare_exchange_u64() is based > > > > on __atomic_compare_exchange_n(). > > > > > > > > static inline bool > > > > pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, > > > > uint64 *expected, uint64 newval) > > > > { > > > > AssertPointerAlignment(expected, 8); > > > > return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, > > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); > > > > } > > > > > > > > According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all > > > > other __ATOMIC_SEQ_CST operations*. It only says about other > > > > __ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes. > > > > This sounds quite far from our comment, promising full barrier semantics, > > > > which, in my understanding, means the equivalent of > > > > both pg_read_barrier()/pg_write_barrier(), which should barrier *all other > > > > read/writes*. > > > > > > A memory barrier in one process/thread always needs be paired with a barrier > > > in another process/thread. It's not enough to have a memory barrier on one > > > side but not the other. > > > > Yes, there surely should be a memory barrier on another side. But > > does __ATOMIC_SEQ_CST works as a barrier for the regular read/write > > operations on the same side? > > Yes, if it's paired with another barrier on the other side. The regular > read/write operations are basically protected transitively, due to > > a) An acquire barrier preventing non-atomic reads/writes in the same thread > from appearing to have been moved to before the barrier > > b) A release barrier preventing non-atomic reads/writes in the same thread > from appearing to have been moved to after the barrier > > c) The other thread being guaranteed a) and b) for the other threads' > non-atomic reads/writes depending on the type of barrier > > d) The atomic value itself being guaranteed to be, well, atomic > > > > > > This function first checks if LSE instructions are present. If so, > > > > instruction LSE casal is used. If not (in my case), the loop of > > > > ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr > > > > provides one-way barriers. Read/writes after ldaxr will be observed after, > > > > and read/writes before stlxr will be observed before. So, a pair of these > > > > instructions provides a full memory barrier. However, if we don't observe > > > > the expected value, only ldaxr gets executed. So, an unsuccessful > > > > pg_atomic_compare_exchange_u64() attempt will leave us with a one-way > > > > barrier, and that caused a bug in my case. > > > > > > That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both > > > success_memorder *and* failure_memorder. > > > > > > What compiler & version is this? > > > > I've tried and got the same results with two compilers. > > gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 > > Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2) > > Thinking more about it I wonder if the behaviour of not doing a release in > case the atomic fails *might* arguably actually be correct - if the > compare-exchange fails, there's nothing that the non-atomic values could be > ordered against. There is another successful compare-exchange executed by a concurrent process to get ordered against. ------ Regards, Alexander Korotkov Supabase
On 2025-03-07 19:44:20 +0200, Alexander Korotkov wrote: > On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote: > > > On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote: > > > > What is the access pattern and the observed problems with it that made you > > > > look at the disassembly? > > > > > > Check this code. > > > > > > l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); > > > > > /* > > > * Try to advance XLogCtl->InitializedUpTo. > > > * > > > * If the CAS operation failed, then some of previous pages are not > > > * initialized yet, and this backend gives up. > > > * > > > * Since initializer of next page might give up on advancing of > > > * InitializedUpTo, this backend have to attempt advancing until it > > > * find page "in the past" or concurrent backend succeeded at > > > * advancing. When we finish advancing XLogCtl->InitializedUpTo, we > > > * notify all the waiters with XLogCtl->InitializedUpToCondVar. > > > */ > > > l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, > > > &NewPageBeginPtr, NewPageEndPtr)) > > > { > > > NewPageBeginPtr = NewPageEndPtr; > > > NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; > > > nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr); > > > > > > l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != > > > NewPageEndPtr) > > > { > > > /* > > > * Page at nextidx wasn't initialized yet, so we cann't move > > > * InitializedUpto further. It will be moved by backend > > > which > > > * will initialize nextidx. > > > */ > > > > > > ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); > > > break; > > > } > > > } > > > > > > Consider the following execution order with process 1 (p1) and process 2 > > > (p2). > > > > On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote: > > > Sorry, I messed this up. > > > The correct sequence is following. > > > > > > 1. p1 executes l1 > > > 2. p1 executes l2 with failure > > > 3. p2 executes l2 with success > > > 4. p2 execute l3, but doesn't see the results of step 1, because 3 > > > didn't provide enough of memory barrier > > > > Did you mean because 2) didn't provide enough of a memory barrier? Because 3) > > does, right? > > Yes, exactly. > > > You could get in exactly same the situation if the p1 is scheduled out by the > > OS after step 1, no? > > No. In that case, p1 will execute l2 with success. p1 executes l2 > with failure only because it goes before p2 executes l2. In my scenario p1 will not execute l2 because p2 gets scheduled before it can do so. So p1 cant yet execute l2 before p2 executes l2. Greetings, Andres Freund
On Fri, Mar 7, 2025 at 7:46 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-07 19:44:20 +0200, Alexander Korotkov wrote: > > On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote: > > > > On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote: > > > > > What is the access pattern and the observed problems with it that made you > > > > > look at the disassembly? > > > > > > > > Check this code. > > > > > > > > l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr); > > > > > > > /* > > > > * Try to advance XLogCtl->InitializedUpTo. > > > > * > > > > * If the CAS operation failed, then some of previous pages are not > > > > * initialized yet, and this backend gives up. > > > > * > > > > * Since initializer of next page might give up on advancing of > > > > * InitializedUpTo, this backend have to attempt advancing until it > > > > * find page "in the past" or concurrent backend succeeded at > > > > * advancing. When we finish advancing XLogCtl->InitializedUpTo, we > > > > * notify all the waiters with XLogCtl->InitializedUpToCondVar. > > > > */ > > > > l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, > > > > &NewPageBeginPtr, NewPageEndPtr)) > > > > { > > > > NewPageBeginPtr = NewPageEndPtr; > > > > NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ; > > > > nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr); > > > > > > > > l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != > > > > NewPageEndPtr) > > > > { > > > > /* > > > > * Page at nextidx wasn't initialized yet, so we cann't move > > > > * InitializedUpto further. It will be moved by backend > > > > which > > > > * will initialize nextidx. > > > > */ > > > > > > > > ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar); > > > > break; > > > > } > > > > } > > > > > > > > Consider the following execution order with process 1 (p1) and process 2 > > > > (p2). > > > > > > On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote: > > > > Sorry, I messed this up. > > > > The correct sequence is following. > > > > > > > > 1. p1 executes l1 > > > > 2. p1 executes l2 with failure > > > > 3. p2 executes l2 with success > > > > 4. p2 execute l3, but doesn't see the results of step 1, because 3 > > > > didn't provide enough of memory barrier > > > > > > Did you mean because 2) didn't provide enough of a memory barrier? Because 3) > > > does, right? > > > > Yes, exactly. > > > > > You could get in exactly same the situation if the p1 is scheduled out by the > > > OS after step 1, no? > > > > No. In that case, p1 will execute l2 with success. p1 executes l2 > > with failure only because it goes before p2 executes l2. > > In my scenario p1 will not execute l2 because p2 gets scheduled before it can > do so. So p1 cant yet execute l2 before p2 executes l2. In my scenario p1.l2 will be successful if executed after p2.l2 and failed if executed before p2.l2. Imagine initial value is 0. p1 tries to change 1=>2, while p2 tries to change 0 => 1. ------ Regards, Alexander Korotkov Supabase