Thread: locked reads for atomics
Moving this to a new thread and adding it to the January commitfest. On Thu, Nov 09, 2023 at 03:27:33PM -0600, Nathan Bossart wrote: > On Tue, Nov 07, 2023 at 04:58:16PM -0800, Andres Freund wrote: >> However, even if there's likely some other implied memory barrier that we >> could piggyback on, the patch much simpler to understand if it doesn't change >> coherency rules. There's no way the overhead could matter. > > I wonder if it's worth providing a set of "locked read" functions. Those > could just do a compare/exchange with 0 in the generic implementation. For > patches like this one where the overhead really shouldn't matter, I'd > encourage folks to use those to make it easy to reason about correctness. Concretely, like this. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
>I wonder if it's worth providing a set of "locked read" functions.
Most out-of-order machines include “read acquire” and “write release” which are pretty close to what you’re suggesting. With the current routines, we only have “read relaxed” and “write relaxed”. I think implementing acquire/release semantics is a very good idea,
I would also like to clarify the properties of atomics. One very important question: Are atomics also volatile? If so, the compiler has very limited ability to move them around. If not, it is difficult to tell when or where they will take place unless the surrounding code is peppered with barriers.
On Fri, Nov 10, 2023 at 09:49:06PM +0000, John Morris wrote: > Most out-of-order machines include “read acquire” and “write release” > which are pretty close to what you’re suggesting. With the current > routines, we only have “read relaxed” and “write relaxed”. I think > implementing acquire/release semantics is a very good idea, We do have both pg_atomic_write_u32() and pg_atomic_unlocked_write_u32() (see commit b0779ab), but AFAICT those only differ in the fallback/spinlock implementations. I suppose there could be an unlocked 64-bit write on platforms that have 8-byte single-copy atomicity but still need to use the fallback/spinlock implementation for some reason, but that might be a bit of a stretch, and the use-cases might be few and far between... > I would also like to clarify the properties of atomics. One very > important question: Are atomics also volatile? The PostgreSQL atomics support appears to ensure they are volatile. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-11-10 14:51:28 -0600, Nathan Bossart wrote: > Moving this to a new thread and adding it to the January commitfest. > > On Thu, Nov 09, 2023 at 03:27:33PM -0600, Nathan Bossart wrote: > > On Tue, Nov 07, 2023 at 04:58:16PM -0800, Andres Freund wrote: > >> However, even if there's likely some other implied memory barrier that we > >> could piggyback on, the patch much simpler to understand if it doesn't change > >> coherency rules. There's no way the overhead could matter. > > > > I wonder if it's worth providing a set of "locked read" functions. Those > > could just do a compare/exchange with 0 in the generic implementation. For > > patches like this one where the overhead really shouldn't matter, I'd > > encourage folks to use those to make it easy to reason about correctness. > > Concretely, like this. I don't think "locked" is a good name - there's no locking. I think that'll deter their use, because it will make it sound more expensive than they are. pg_atomic_read_membarrier_u32()? > @@ -228,7 +228,8 @@ pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val) > * The read is guaranteed to return a value as it has been written by this or > * another process at some point in the past. There's however no cache > * coherency interaction guaranteeing the value hasn't since been written to > - * again. > + * again. Consider using pg_atomic_locked_read_u32() unless you have a strong > + * reason (e.g., performance) to use unlocked reads. I think that's too strong an endorsement. Often there will be no difference in difficulty analysing correctness, because the barrier pairing / the interaction with the surrounding code needs to be analyzed just as much. > * No barrier semantics. > */ > @@ -239,6 +240,24 @@ pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr) > return pg_atomic_read_u32_impl(ptr); > } > > +/* > + * pg_atomic_read_u32 - locked read from atomic variable. Un-updated name... > + * This read is guaranteed to read the current value, It doesn't guarantee that *at all*. What it guarantees is solely that the current CPU won't be doing something that could lead to reading an outdated value. To actually ensure the value is up2date, the modifying side also needs to have used a form of barrier (in the form of fetch_add, compare_exchange, etc or an explicit barrier). > +#ifndef PG_HAVE_ATOMIC_LOCKED_READ_U32 > +#define PG_HAVE_ATOMIC_LOCKED_READ_U32 > +static inline uint32 > +pg_atomic_locked_read_u32_impl(volatile pg_atomic_uint32 *ptr) > +{ > + uint32 old = 0; > + > + /* > + * In the generic implementation, locked reads are implemented as a > + * compare/exchange with 0. That'll fail or succeed, but always return the > + * most up-to-date value. It might also store a 0, but only if the > + * previous value was also a zero, i.e., harmless. > + */ > + pg_atomic_compare_exchange_u32_impl(ptr, &old, 0); > + > + return old; > +} > +#endif I suspect implementing it with an atomic fetch_add of 0 would be faster... Greetings, Andres Freund
Hi, On 2023-11-10 21:49:06 +0000, John Morris wrote: > >I wonder if it's worth providing a set of "locked read" functions. > > Most out-of-order machines include “read acquire” and “write release” which > are pretty close to what you’re suggesting. Is that really true? It's IA64 lingo. X86 doesn't have them, while arm has more granular barriers, they don't neatly map onto acquire/release either. I don't think an acquire here would actually be equivalent to making this a full barrier - an acquire barrier allows moving reads or stores from *before* the barrier to be moved after the barrier. It just prevents the opposite. And for proper use of acquire/release semantics we'd need to pair operations much more closely. Right now we often rely on another preceding memory barrier to ensure correct ordering, having to use paired operations everywhere would lead to slower code. I thoroughly dislike how strongly C++11/C11 prefer paired atomics *on the same address* over "global" fences. It often leads to substantially slower code. And they don't at all map neatly on hardware, where largely barrier semantics are *not* tied to individual addresses. And the fence specification is just about unreadable (although I think they did fix some of the worst issues). Greetings, Andres Freund
On Fri, Nov 10, 2023 at 03:11:50PM -0800, Andres Freund wrote: > On 2023-11-10 14:51:28 -0600, Nathan Bossart wrote: >> + * This read is guaranteed to read the current value, > > It doesn't guarantee that *at all*. What it guarantees is solely that the > current CPU won't be doing something that could lead to reading an outdated > value. To actually ensure the value is up2date, the modifying side also needs > to have used a form of barrier (in the form of fetch_add, compare_exchange, > etc or an explicit barrier). Okay, I think I was missing that this doesn't work along with pg_atomic_write_u32() because that doesn't have any barrier semantics (probably because the spinlock version does). IIUC you'd want to use pg_atomic_exchange_u32() to write the value instead, which seems to really just be another compare/exchange under the hood. Speaking of the spinlock implementation of pg_atomic_write_u32(), I've been staring at this comment for a while and can't make sense of it: void pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val) { /* * One might think that an unlocked write doesn't need to acquire the * spinlock, but one would be wrong. Even an unlocked write has to cause a * concurrent pg_atomic_compare_exchange_u32() (et al) to fail. */ SpinLockAcquire((slock_t *) &ptr->sema); ptr->value = val; SpinLockRelease((slock_t *) &ptr->sema); } It refers to "unlocked writes," but this isn't pg_atomic_unlocked_write_u32_impl(). The original thread for this comment [0] doesn't offer any hints, either. Does "unlocked" mean something different here, such as "write without any barrier semantics?" [0] https://postgr.es/m/14947.1475690465%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-11-10 20:38:13 -0600, Nathan Bossart wrote: > On Fri, Nov 10, 2023 at 03:11:50PM -0800, Andres Freund wrote: > > On 2023-11-10 14:51:28 -0600, Nathan Bossart wrote: > >> + * This read is guaranteed to read the current value, > > > > It doesn't guarantee that *at all*. What it guarantees is solely that the > > current CPU won't be doing something that could lead to reading an outdated > > value. To actually ensure the value is up2date, the modifying side also needs > > to have used a form of barrier (in the form of fetch_add, compare_exchange, > > etc or an explicit barrier). > > Okay, I think I was missing that this doesn't work along with > pg_atomic_write_u32() because that doesn't have any barrier semantics > (probably because the spinlock version does). IIUC you'd want to use > pg_atomic_exchange_u32() to write the value instead, which seems to really > just be another compare/exchange under the hood. Yes. We should optimize pg_atomic_exchange_u32() one of these days - it can be done *far* faster than a cmpxchg. When I was adding the atomic abstraction there was concern with utilizing too many different atomic instructions. I didn't really agree back then, but these days I really don't see a reason to not use a few more intrinsics. > Speaking of the spinlock implementation of pg_atomic_write_u32(), I've been > staring at this comment for a while and can't make sense of it: > > void > pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val) > { > /* > * One might think that an unlocked write doesn't need to acquire the > * spinlock, but one would be wrong. Even an unlocked write has to cause a > * concurrent pg_atomic_compare_exchange_u32() (et al) to fail. > */ > SpinLockAcquire((slock_t *) &ptr->sema); > ptr->value = val; > SpinLockRelease((slock_t *) &ptr->sema); > } > > It refers to "unlocked writes," but this isn't > pg_atomic_unlocked_write_u32_impl(). The original thread for this comment > [0] doesn't offer any hints, either. Does "unlocked" mean something > different here, such as "write without any barrier semantics?" It's just about not using the spinlock. If we were to *not* use a spinlock here, we'd break pg_atomic_compare_exchange_u32(), because the spinlock-implementation of pg_atomic_compare_exchange_u32() needs to actually be able to rely on no concurrent changes to the value to happen. Greetings, Andres Freund
On Fri, Nov 10, 2023 at 06:48:39PM -0800, Andres Freund wrote: > Yes. We should optimize pg_atomic_exchange_u32() one of these days - it can be > done *far* faster than a cmpxchg. When I was adding the atomic abstraction > there was concern with utilizing too many different atomic instructions. I > didn't really agree back then, but these days I really don't see a reason to > not use a few more intrinsics. I might give this a try, if for no other reason than it'd force me to improve my mental model of this stuff. :) >> It refers to "unlocked writes," but this isn't >> pg_atomic_unlocked_write_u32_impl(). The original thread for this comment >> [0] doesn't offer any hints, either. Does "unlocked" mean something >> different here, such as "write without any barrier semantics?" > > It's just about not using the spinlock. If we were to *not* use a spinlock > here, we'd break pg_atomic_compare_exchange_u32(), because the > spinlock-implementation of pg_atomic_compare_exchange_u32() needs to actually > be able to rely on no concurrent changes to the value to happen. Thanks for clarifying. I thought it might've been hinting at something beyond the compare/exchange implications. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Here's a v2 of the patch set in which I've attempted to address all feedback. I've also added a pg_write_membarrier_u* pair of functions that provide an easy way to write to an atomic variable with full barrier semantics. In the generic implementation, these are just aliases for an atomic exchange. 0002 demonstrates how these functions might be used to eliminate the arch_lck spinlock, which is only ever used for one boolean variable. My hope is that the membarrier functions make eliminating spinlocks for non-performance-sensitive code easy to reason about. (We might be able to use a pg_atomic_flag instead for 0002, but that code seems intended for a slightly different use-case and has more complicated barrier semantics.) -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
> On Nov 28, 2023, at 05:00, Nathan Bossart <nathandbossart@gmail.com> wrote: > > External Email > > Here's a v2 of the patch set in which I've attempted to address all > feedback. I've also added a pg_write_membarrier_u* pair of functions that > provide an easy way to write to an atomic variable with full barrier > semantics. In the generic implementation, these are just aliases for an > atomic exchange. > > 0002 demonstrates how these functions might be used to eliminate the > arch_lck spinlock, which is only ever used for one boolean variable. My > hope is that the membarrier functions make eliminating spinlocks for > non-performance-sensitive code easy to reason about. > > (We might be able to use a pg_atomic_flag instead for 0002, but that code > seems intended for a slightly different use-case and has more complicated > barrier semantics.) > > -- > Nathan Bossart > Amazon Web Services: https://aws.amazon.com Hi Nathan, The patch looks good to me. The patch adds two pairs of atomic functions that provide full-barrier semantics to atomic read/write operations. The patchalso includes an example of how this new functions can be used to replace spin locks. The patch applies cleanly to HEAD. “make check-world” also runs cleanly with no error. I am moving it to Ready for Committers. Regards, Yong
On Tue, Nov 28, 2023 at 2:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Here's a v2 of the patch set in which I've attempted to address all > feedback. I've also added a pg_write_membarrier_u* pair of functions that There's some immediate use for reads/writes with barrier semantics - https://www.postgresql.org/message-id/CALj2ACXrePj4E6ocKr-%2Bb%3DrjT-8yeMmHnEeWQP1bc-WXETfTVw%40mail.gmail.com. Any plan for taking this forward? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, 2024-02-22 at 12:58 +0530, Bharath Rupireddy wrote: > There's some immediate use for reads/writes with barrier semantics - Is this mainly a convenience for safety/readability? Or is it faster in some cases than doing an atomic access with separate memory barriers? Regards, Jeff Davis
On Thu, Feb 22, 2024 at 11:53:50AM -0800, Jeff Davis wrote: > On Thu, 2024-02-22 at 12:58 +0530, Bharath Rupireddy wrote: >> There's some immediate use for reads/writes with barrier semantics - > > Is this mainly a convenience for safety/readability? Or is it faster in > some cases than doing an atomic access with separate memory barriers? The former. Besides the 0002 patch tracked here, there's at least one other patch [0] that could probably use these new functions. The idea is to provide an easy way to remove spinlocks, etc. and use atomics for less performance-sensitive stuff. The implementations are intended to be relatively inexpensive and might continue to improve in the future, but the functions are primarily meant to help reason about correctness. I don't mind prioritizing these patches, especially since there now seems to be multiple patches waiting on it. IIRC I was worried about not having enough support for this change, but I might now have it. [0] https://commitfest.postgresql.org/47/4330/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, 2024-02-23 at 10:17 -0600, Nathan Bossart wrote: > The idea is > to provide an easy way to remove spinlocks, etc. and use atomics for > less > performance-sensitive stuff. The implementations are intended to be > relatively inexpensive and might continue to improve in the future, > but the > functions are primarily meant to help reason about correctness. To be clear: x = pg_atomic_[read|write]_membarrier_u64(&v); is semantically equivalent to: pg_memory_barrier(); x = pg_atomic_[read|write]_u64(&v); pg_memory_barrier(); ? If so, that does seem more convenient. Regards, Jeff Davis
On Fri, Feb 23, 2024 at 10:25:00AM -0800, Jeff Davis wrote: > To be clear: > > x = pg_atomic_[read|write]_membarrier_u64(&v); > > is semantically equivalent to: > > pg_memory_barrier(); > x = pg_atomic_[read|write]_u64(&v); > pg_memory_barrier(); > > ? > > If so, that does seem more convenient. I think that's about right. The upthread feedback from Andres [0] provides some additional context. [0] https://postgr.es/m/20231110231150.fjm77gup2i7xu6hc%40alap3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Here is a v3 of the patch set with the first draft of the commit messages. There are no code differences between v2 and v3. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2024-02-23 10:25:00 -0800, Jeff Davis wrote: > On Fri, 2024-02-23 at 10:17 -0600, Nathan Bossart wrote: > > The idea is > > to provide an easy way to remove spinlocks, etc. and use atomics for > > less > > performance-sensitive stuff. The implementations are intended to be > > relatively inexpensive and might continue to improve in the future, > > but the > > functions are primarily meant to help reason about correctness. > > To be clear: > > x = pg_atomic_[read|write]_membarrier_u64(&v); > > is semantically equivalent to: > > pg_memory_barrier(); > x = pg_atomic_[read|write]_u64(&v); > pg_memory_barrier(); > ? > > If so, that does seem more convenient. Kinda. Semantically I think that's correct, however it doesn't commonly make sense to have both those memory barriers, so you wouldn't really write code like that and thus comparing on the basis of convenience doesn't quite seem right. Rather than convenience, I think performance and simplicity are better arguments. If you're going to execute a read and then a memory barrier, it's going to be faster to just do a single atomic operation. And it's a bit simpler to analyze on which "side" of the read/write the barrier is needed. Greetings, Andres Freund
Hi, On 2024-02-23 14:58:12 -0600, Nathan Bossart wrote: > +/* > + * pg_atomic_write_membarrier_u32 - write with barrier semantics. > + * > + * The write is guaranteed to succeed as a whole, i.e., it's not possible to > + * observe a partial write for any reader. Note that this correctly interacts > + * with both pg_atomic_compare_exchange_u32() and > + * pg_atomic_read_membarrier_u32(). While this may be less performant than > + * pg_atomic_write_u32() and pg_atomic_unlocked_write_u32(), it may be easier > + * to reason about correctness with this function in less performance-sensitive > + * code. > + * > + * Full barrier semantics. > + */ The callout to pg_atomic_unlocked_write_u32() is wrong. The reason to use pg_atomic_unlocked_write_u32() is for variables where we do not ever want to fall back to spinlocks/semaphores, because the underlying variable isn't actually shared. In those cases using the other variants is a bug. The only use of pg_atomic_unlocked_write_u32() is temp-table buffers which share the data structure with the shared buffers case. Greetings, Andres Freund
On Fri, Feb 23, 2024 at 05:34:49PM -0800, Andres Freund wrote: > On 2024-02-23 14:58:12 -0600, Nathan Bossart wrote: >> +/* >> + * pg_atomic_write_membarrier_u32 - write with barrier semantics. >> + * >> + * The write is guaranteed to succeed as a whole, i.e., it's not possible to >> + * observe a partial write for any reader. Note that this correctly interacts >> + * with both pg_atomic_compare_exchange_u32() and >> + * pg_atomic_read_membarrier_u32(). While this may be less performant than >> + * pg_atomic_write_u32() and pg_atomic_unlocked_write_u32(), it may be easier >> + * to reason about correctness with this function in less performance-sensitive >> + * code. >> + * >> + * Full barrier semantics. >> + */ > > The callout to pg_atomic_unlocked_write_u32() is wrong. The reason to use > pg_atomic_unlocked_write_u32() is for variables where we do not ever want to > fall back to spinlocks/semaphores, because the underlying variable isn't > actually shared. In those cases using the other variants is a bug. The only > use of pg_atomic_unlocked_write_u32() is temp-table buffers which share the > data structure with the shared buffers case. I removed the reference to pg_atomic_unlocked_write_u32(). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Committed. Thank you for reviewing! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com