Thread: better atomics
On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Partially that only works because we sprinkle 'volatile's on lots of > places. That can actually hurt performance... it'd usually be something > like: > #define pg_atomic_load(uint32) (*(volatile uint32 *)(atomic)) > > Even if not needed in some places because a variable is already > volatile, it's very helpful in pointing out what happens and where you > need to be careful. That's fine with me. >> > * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val) >> > * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval) >> > * uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add) >> > * uint32 pg_atomic_fetch_sub_u32(uint32 *ptr, uint32 add) >> > * uint32 pg_atomic_fetch_and_u32(uint32 *ptr, uint32 add) >> > * uint32 pg_atomic_fetch_or_u32(uint32 *ptr, uint32 add) >> >> Do we really need all of those? Compare-and-swap is clearly needed, >> and fetch-and-add is also very useful. I'm not sure about the rest. >> Not that I necessarily object to having them, but it will be a lot >> more work. > > Well, _sub can clearly be implemented with _add generically. I find code > using _sub much easier to read than _add(-whatever), but that's maybe > just me. Yeah, I wouldn't bother with _sub. > But _and, _or are really useful because they can be used to atomically > clear and set flag bits. Until we have some concrete application that requires this functionality for adequate performance, I'd be inclined not to bother.I think we have bigger fish to fry, and (to extendmy nautical metaphor) trying to get this working everywhere might amount to trying to boil the ocean. >> > * u64 variants of the above >> > * bool pg_atomic_test_set(void *ptr) >> > * void pg_atomic_clear(void *ptr) >> >> What are the intended semantics here? > > It's basically TAS() without defining whether setting a value that needs > to be set is a 1 or a 0. Gcc provides this and I think we should make > our spinlock implementation use it if there is no special cased > implementation available. I'll reserve judgement until I see the patch. >> More general question: how do we move the ball down the field in this >> area? I'm willing to do some of the legwork, but I doubt I can do all >> of it, and I really think we need to make some progress here soon, as >> it seems that you and I and Ants are all running into the same >> problems in slightly different ways. What's our strategy for getting >> something done here? > > That's a pretty good question. > > I am willing to write the gcc implementation, plus the generic wrappers > and provide the infrastructure to override it per-platform. I won't be > able to do anything about non-linux, non-gcc platforms in that timeframe > though. > > I was thinking of something like: > include/storage/atomic.h > include/storage/atomic-gcc.h > include/storage/atomic-msvc.h > include/storage/atomic-acc-ia64.h > where atomic.h first has a list of #ifdefs including the specialized > files and then lots of #ifndef providing generic variants if not > already provided by the platorm specific file. Seems like we might want to put some of this in one of the various directories called "port", or maybe a new subdirectory of one of them.This basically sounds sane, but I'm unwilling to committo dropping support for all obscure platforms we currently support unless there are about 500 people +1ing the idea, so I think we need to think about what happens when we don't have platform support for these primitives. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-16 14:30:34 -0400, Robert Haas wrote: > > But _and, _or are really useful because they can be used to atomically > > clear and set flag bits. > > Until we have some concrete application that requires this > functionality for adequate performance, I'd be inclined not to bother. > I think we have bigger fish to fry, and (to extend my nautical > metaphor) trying to get this working everywhere might amount to trying > to boil the ocean. Well, I actually have a very, very preliminary patch using them in the course of removing the spinlocks in PinBuffer() (via LockBufHdr()). I think it's also going to be easier to add all required atomics at once than having to go over several platforms in independent feature patches adding atomics one by one. > > I was thinking of something like: > > include/storage/atomic.h > > include/storage/atomic-gcc.h > > include/storage/atomic-msvc.h > > include/storage/atomic-acc-ia64.h > > where atomic.h first has a list of #ifdefs including the specialized > > files and then lots of #ifndef providing generic variants if not > > already provided by the platorm specific file. > Seems like we might want to put some of this in one of the various > directories called "port", or maybe a new subdirectory of one of them. Hm. I'd have to be src/include/port, right? Not sure if putting some files there will be cleaner, but I don't mind it either. > This basically sounds sane, but I'm unwilling to commit to dropping > support for all obscure platforms we currently support unless there > are about 500 people +1ing the idea, so I think we need to think about > what happens when we don't have platform support for these primitives. I think the introduction of the atomics really isn't much dependent on the removal. The only thing I'd touch around platforms in that patch is adding a generic fallback pg_atomic_test_and_set() to s_lock.h and remove the special case usages of __sync_lock_test_and_set() (Arm64, some 32 bit arms). It's the patches that would like to rely on atomics that will have the problem :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-10-16 14:30:34 -0400, Robert Haas wrote: >>> But _and, _or are really useful because they can be used to atomically >>> clear and set flag bits. >> Until we have some concrete application that requires this >> functionality for adequate performance, I'd be inclined not to bother. >> I think we have bigger fish to fry, and (to extend my nautical >> metaphor) trying to get this working everywhere might amount to trying >> to boil the ocean. > Well, I actually have a very, very preliminary patch using them in the > course of removing the spinlocks in PinBuffer() (via LockBufHdr()). I think we need to be very, very wary of making our set of required atomics any larger than it absolutely has to be. The more stuff we require, the closer we're going to be to making PG a program that only runs well on Intel. I am not comforted by the "gcc will provide good implementations of the atomics everywhere" argument, because in fact it won't. See for example the comment associated with our only existing use of gcc atomics: * On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if* not fall back on the SWPB instruction. SWPBdoes not work on ARMv6 or* later, so the compiler builtin is preferred if available. Note also that* the int-width variantof the builtin works on more chips than other widths. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ That's not a track record that should inspire much faith that complete sets of atomics will exist elsewhere. What's more, we don't just need atomics that *work*, we need atomics that are *fast*, else the whole exercise turns into pessimization not improvement. The more atomic ops we depend on, the more likely it is that some of them will involve kernel support on some chips, destroying whatever performance improvement we hoped to get. > ... The only thing I'd touch around platforms in that patch is > adding a generic fallback pg_atomic_test_and_set() to s_lock.h and > remove the special case usages of __sync_lock_test_and_set() (Arm64, > some 32 bit arms). Um, if we had a "generic" pg_atomic_test_and_set(), s_lock.h would be about one line long. The above quote seems to me to be exactly the kind of optimism that is not warranted in this area. regards, tom lane
On 2013-10-16 22:39:07 +0200, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-10-16 14:30:34 -0400, Robert Haas wrote: > >>> But _and, _or are really useful because they can be used to atomically > >>> clear and set flag bits. > > >> Until we have some concrete application that requires this > >> functionality for adequate performance, I'd be inclined not to bother. > >> I think we have bigger fish to fry, and (to extend my nautical > >> metaphor) trying to get this working everywhere might amount to trying > >> to boil the ocean. > > > Well, I actually have a very, very preliminary patch using them in the > > course of removing the spinlocks in PinBuffer() (via LockBufHdr()). > > I think we need to be very, very wary of making our set of required > atomics any larger than it absolutely has to be. The more stuff we > require, the closer we're going to be to making PG a program that only > runs well on Intel. Well, essentially I am advocating to support basically three operations: * compare and swap * atomic add (and by that sub) * test and set The other operations are just porcelain around that. With compare and swap both the others can be easily implemented if neccessary. Note that e.g. linux - running on all platforms we're talking about but VAX - exensively and unconditionally uses atomic operations widely. So I think we don't have to be too afraid about non-intel performance here. > I am not comforted by the "gcc will provide good > implementations of the atomics everywhere" argument, because in fact > it won't. See for example the comment associated with our only existing > use of gcc atomics: > > * On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if > * not fall back on the SWPB instruction. SWPB does not work on ARMv6 or > * later, so the compiler builtin is preferred if available. Note also that > * the int-width variant of the builtin works on more chips than other widths. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > That's not a track record that should inspire much faith that complete > sets of atomics will exist elsewhere. What's more, we don't just need > atomics that *work*, we need atomics that are *fast*, else the whole > exercise turns into pessimization not improvement. The more atomic ops > we depend on, the more likely it is that some of them will involve kernel > support on some chips, destroying whatever performance improvement we > hoped to get. Hm. I can't really follow. We *prefer* to use __sync_lock_test_and_set in contrast to our own open-coded implementation, right? And the comment about some hardware only supporting "int-width" matches that I only want to require u32 atomics support, right? I completely agree that we cannot rely on 8byte math or similar (16byte cmpxchg) to be supported by all platforms. That indeed would require kernel fallbacks on e.g. ARM. > > ... The only thing I'd touch around platforms in that patch is > > adding a generic fallback pg_atomic_test_and_set() to s_lock.h and > > remove the special case usages of __sync_lock_test_and_set() (Arm64, > > some 32 bit arms). > > Um, if we had a "generic" pg_atomic_test_and_set(), s_lock.h would be > about one line long. The above quote seems to me to be exactly the kind > of optimism that is not warranted in this area. Well, I am somewhat hesitant to change s_lock for more platforms than necessary. So I proposed to restructure it in a way that leaves all existing implementations in place that do not already rely on __sync_lock_test_and_set(). There's also SPIN_DELAY btw, which I don't see any widely provided intrinsics for. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I have a related problem, which is that some code I'm currently >> working on vis-a-vis parallelism can run lock-free on platforms with >> atomic 8 bit assignment but needs a spinlock or two elsewhere. So I'd >> want to use pg_atomic_store_u64(), but I'd also need a clean way to >> test, at compile time, whether it's available. > > Yes, definitely. There should be a couple of #defines that declare > whether non-prerequisite options are supported. I'd guess we want at least: > * 8byte math > * 16byte compare_and_swap I'm not terribly excited about relying on 16-byte CAS, but I agree that 8-byte math, at least, is important. I've not been successful in finding any evidence that gcc has preprocessor symbols to tell us about the properties of 8-byte loads and stores. The closest thing that I found is: http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html That page provides intrinsics for 8-byte atomic loads and stores, among other things. But it seems that the only method for determining whether they work on a particular target is to compile a test program and see whether it complains about __atomic_load_n and/or __atomic_store_n being unresolved symbols. I suppose we could add a configure test for that. Yuck. Anyone have a better idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-28 14:10:48 -0400, Robert Haas wrote: > On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> I have a related problem, which is that some code I'm currently > >> working on vis-a-vis parallelism can run lock-free on platforms with > >> atomic 8 bit assignment but needs a spinlock or two elsewhere. So I'd > >> want to use pg_atomic_store_u64(), but I'd also need a clean way to > >> test, at compile time, whether it's available. > > > > Yes, definitely. There should be a couple of #defines that declare > > whether non-prerequisite options are supported. I'd guess we want at least: > > * 8byte math > > * 16byte compare_and_swap > > I'm not terribly excited about relying on 16-byte CAS, but I agree > that 8-byte math, at least, is important. I've not been successful in > finding any evidence that gcc has preprocessor symbols to tell us > about the properties of 8-byte loads and stores. The closest thing > that I found is: I am talking about making 16byte CAS explicitly optional though? I think if code wants to optionally make use of it (e.g. on x86 where it's been available basically forever) that's fine. It just has to be optional. Or are you saying you're simply not interested in 16byte CAS generally? Same thing for 8byte math - there's no chance that that is going to be supported over all platforms anytime soon, so it has to be optional. > http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > That page provides intrinsics for 8-byte atomic loads and stores, > among other things. But it seems that the only method for determining > whether they work on a particular target is to compile a test program > and see whether it complains about __atomic_load_n and/or > __atomic_store_n being unresolved symbols. I suppose we could add a > configure test for that. Yuck. Well, that's pretty easy to integrate into configure - and works on crossbuilds. So I think that'd be ok. But I actually think this is going to be a manual thing, atomic 8byte math will fall back to kernel emulation on several targets, so we probably want some defines to explicitly declare it supported on targets where that's not the case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 28, 2013 at 2:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I'm not terribly excited about relying on 16-byte CAS, but I agree >> that 8-byte math, at least, is important. I've not been successful in >> finding any evidence that gcc has preprocessor symbols to tell us >> about the properties of 8-byte loads and stores. The closest thing >> that I found is: > > I am talking about making 16byte CAS explicitly optional though? I think > if code wants to optionally make use of it (e.g. on x86 where it's been > available basically forever) that's fine. It just has to be optional. > Or are you saying you're simply not interested in 16byte CAS generally? I am just not interested in it generally. Relying on too many OS or architecture-specific primitives has several disadvantages. It's going to be a nuisance on more obscure platforms where they may or may not be supported and may or may not work right even if supported. Even once we get things working right everywhere, it'll mean that performance may suffer on non-mainstream platforms. And I think in many cases it may suggest that we're using an architecture-specific quirk to work around a fundamental problem with our algorithms. I'm more or less convinced of the need for 8-byte atomic reads and writes, test-and-set, 8-byte compare-and-swap, and 8-byte fetch-and-add. But beyond that I'm less sold. Most of the academic papers I've read on implementing lock-free or highly-parallel constructs attempt to confine themselves to 8-byte operations with 8-byte compare-and-swap, and I'm a bit disposed to think we ought to try to hew to that as well. I'm not dead set against going further, but I lean against it, for all of the reasons mentioned above. > Same thing for 8byte math - there's no chance that that is going to be > supported over all platforms anytime soon, so it has to be optional. Agreed. >> http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html >> >> That page provides intrinsics for 8-byte atomic loads and stores, >> among other things. But it seems that the only method for determining >> whether they work on a particular target is to compile a test program >> and see whether it complains about __atomic_load_n and/or >> __atomic_store_n being unresolved symbols. I suppose we could add a >> configure test for that. Yuck. > > Well, that's pretty easy to integrate into configure - and works on > crossbuilds. So I think that'd be ok. > > But I actually think this is going to be a manual thing, atomic 8byte > math will fall back to kernel emulation on several targets, so we > probably want some defines to explicitly declare it supported on targets > where that's not the case. Hmm, OK. I had not come across such cases. There are architectures where it Just Works (like x64_64), architectures where you can make it work by using special instructions (like x86), and (presumably) architectures where it doesn't work at all. Places where it works using really slow kernel emulation are yet another category to worry about. Unfortunately, I have not found any good source that describes which architectures fall into which category. Without that, pulling this together seems intimidating, unless we just declare it working for x86_64, disable it everywhere else, and wait for people running on other architectures to complain. I wonder whether it'd be safe to assume that any machine where pointers are 8 bytes has 8-byte atomic loads and stores. I bet there is a counterexample somewhere. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-28 15:02:41 -0400, Robert Haas wrote: > On Mon, Oct 28, 2013 at 2:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> I'm not terribly excited about relying on 16-byte CAS, but I agree > >> that 8-byte math, at least, is important. I've not been successful in > >> finding any evidence that gcc has preprocessor symbols to tell us > >> about the properties of 8-byte loads and stores. The closest thing > >> that I found is: > > > > I am talking about making 16byte CAS explicitly optional though? I think > > if code wants to optionally make use of it (e.g. on x86 where it's been > > available basically forever) that's fine. It just has to be optional. > > Or are you saying you're simply not interested in 16byte CAS generally? > > I am just not interested in it generally. Relying on too many OS or > architecture-specific primitives has several disadvantages. It's > going to be a nuisance on more obscure platforms where they may or may > not be supported and may or may not work right even if supported. > Even once we get things working right everywhere, it'll mean that > performance may suffer on non-mainstream platforms. But it'll suffer against the *increased* performance on modern platforms, it shouldn't suffer noticeably against the previous performance on the older platform if we're doing our homework... > Most of the academic papers I've read on > implementing lock-free or highly-parallel constructs attempt to > confine themselves to 8-byte operations with 8-byte compare-and-swap, > and I'm a bit disposed to think we ought to try to hew to that as > well. I'm not dead set against going further, but I lean against it, > for all of the reasons mentioned above. I think there are quite some algorithms relying on 16byte CAS, that's why I was thinking about it at all. I think it's easier to add support for it in the easier trawl through the compilers, but I won't argue much for it otherwise for now. > > But I actually think this is going to be a manual thing, atomic 8byte > > math will fall back to kernel emulation on several targets, so we > > probably want some defines to explicitly declare it supported on targets > > where that's not the case. > > Hmm, OK. I had not come across such cases. E.g. ARM/linux which we probably cannot ignore. > Places where it works > using really slow kernel emulation are yet another category to worry > about. Unfortunately, I have not found any good source that describes > which architectures fall into which category. Without that, pulling > this together seems intimidating, unless we just declare it working > for x86_64, disable it everywhere else, and wait for people running on > other architectures to complain. Yes, I think whitelisting targs is a sensible approach here. I am pretty sure we can do better than just x86-64, but that's easily doable in an incremental fashion. > I wonder whether it'd be safe to assume that any machine where > pointers are 8 bytes has 8-byte atomic loads and stores. I bet there > is a counterexample somewhere. :-( Sparc64 :(. Btw, could you quickly give some keywords what you're thinking about making lockless? I mainly am thinking about lwlocks and buffer pins so far. Nothing really ambitious. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 28, 2013 at 3:32 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I wonder whether it'd be safe to assume that any machine where >> pointers are 8 bytes has 8-byte atomic loads and stores. I bet there >> is a counterexample somewhere. :-( > > Sparc64 :(. > > Btw, could you quickly give some keywords what you're thinking about > making lockless? > I mainly am thinking about lwlocks and buffer pins so far. Nothing > really ambitious. Well, I was going to use it for some code I'm working on for parallelism, but I just tested the overhead of a spinlock, and it was zero, possibly negative. So I may not have an immediate application. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 28.10.2013 21:32, Andres Freund wrote: > On 2013-10-28 15:02:41 -0400, Robert Haas wrote: >> Most of the academic papers I've read on >> implementing lock-free or highly-parallel constructs attempt to >> confine themselves to 8-byte operations with 8-byte compare-and-swap, >> and I'm a bit disposed to think we ought to try to hew to that as >> well. I'm not dead set against going further, but I lean against it, >> for all of the reasons mentioned above. > > I think there are quite some algorithms relying on 16byte CAS, that's > why I was thinking about it at all. I think it's easier to add support > for it in the easier trawl through the compilers, but I won't argue much > for it otherwise for now. Many algorithms require a 2*(pointer width) CAS instruction. On 64-bit platforms that's 16 bytes, but on 32-bit platforms an 8 byte version will suffice. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 28.10.2013 21:32, Andres Freund wrote: >> I think there are quite some algorithms relying on 16byte CAS, that's >> why I was thinking about it at all. I think it's easier to add support >> for it in the easier trawl through the compilers, but I won't argue much >> for it otherwise for now. > Many algorithms require a 2*(pointer width) CAS instruction. On 64-bit > platforms that's 16 bytes, but on 32-bit platforms an 8 byte version > will suffice. You're both just handwaving. How many is "many", and which ones might we actually have enough use for to justify dealing with such a dependency? I don't think we should buy into this without some pretty concrete justification. regards, tom lane
On 2013-10-28 16:06:47 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > On 28.10.2013 21:32, Andres Freund wrote: > >> I think there are quite some algorithms relying on 16byte CAS, that's > >> why I was thinking about it at all. I think it's easier to add support > >> for it in the easier trawl through the compilers, but I won't argue much > >> for it otherwise for now. > > > Many algorithms require a 2*(pointer width) CAS instruction. On 64-bit > > platforms that's 16 bytes, but on 32-bit platforms an 8 byte version > > will suffice. > > You're both just handwaving. How many is "many", and which ones might > we actually have enough use for to justify dealing with such a dependency? > I don't think we should buy into this without some pretty concrete > justification. Well, I don't think either of us is suggesting to make it required. But it's going to be painful to go through the list of compilers repeatedly instead of just adding all operations at one. I am willing to do that for several compilers once, but I don't want to do it in each and every feature submission using another atomic op. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-10-28 16:06:47 -0400, Tom Lane wrote: >> You're both just handwaving. How many is "many", and which ones might >> we actually have enough use for to justify dealing with such a dependency? >> I don't think we should buy into this without some pretty concrete >> justification. > Well, I don't think either of us is suggesting to make it required. But > it's going to be painful to go through the list of compilers repeatedly > instead of just adding all operations at one. I am willing to do that > for several compilers once, but I don't want to do it in each and every > feature submission using another atomic op. Basically I'm arguing that that choice is backwards. We should decide first what the list of supported atomics is going to be, and each and every element of that list has to have a convincing concrete argument why we need to support it. Not "we might want this someday". Because when someday comes, that's when we'd be paying the price of finding out which platforms actually support the primitive correctly and with what performance. Or if someday never comes, we're not ahead either. Or if you like: no matter what you do, the first feature submission that makes use of a given atomic op is going to suffer pain. I don't want to still be suffering that pain two or three years out. The shorter the list of allowed atomic ops, the sooner we'll be done with debugging them. regards, tom lane
On 2013-10-28 16:29:35 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-10-28 16:06:47 -0400, Tom Lane wrote: > >> You're both just handwaving. How many is "many", and which ones might > >> we actually have enough use for to justify dealing with such a dependency? > >> I don't think we should buy into this without some pretty concrete > >> justification. > > > Well, I don't think either of us is suggesting to make it required. But > > it's going to be painful to go through the list of compilers repeatedly > > instead of just adding all operations at one. I am willing to do that > > for several compilers once, but I don't want to do it in each and every > > feature submission using another atomic op. > > Basically I'm arguing that that choice is backwards. We should decide > first what the list of supported atomics is going to be, and each and > every element of that list has to have a convincing concrete argument > why we need to support it. The list I have previously suggested was: * pg_atomic_load_u32(uint32 *) * uint32 pg_atomic_store_u32(uint32 *) To be able to write code without spreading volatiles all over while making it very clear that that part of the code is worthy of attention. * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val) * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval) Used in most lockfree algorithms, can be used to provide fallbacks for when any of the other 32bit operations is not implemented for a platform (so it's easier to bring up a platform). E.g. used in in the wait-free LW_SHARED patch. * uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add) Plain math. E.g. used in in the wait-free LW_SHARED patch. * uint32 pg_atomic_fetch_and_u32(uint32 *ptr, uint32 add) * uint32 pg_atomic_fetch_or_u32(uint32 *ptr, uint32 add) Useful for (un-)setting flags atomically. Needed for (my approach to) spinlock-less Pin/UnpinBuffer. * u64 variants of the above lockfree list manipulation need at least pointer width operations of at least compare_exchange(). I *personally* don't have a case for 8byte math yet, but I am pretty sure parallel sort might be interested in it. * bool pg_atomic_test_set(void *ptr) * void pg_atomic_clear(void *ptr) Can be used as the implementation for spinlocks based on compiler intrinsics if no native implementation is existing. Makes it easier to bring up new platforms. Wrappers around the above: * uint32 pg_atomic_fetch_sub_u32(uint32 *ptr, uint32 add) Generic wrapper that imo makes the code easier to read. No per-platform/compiler overhead. * pg_atomic_add_until_u32(uint32 *ptr, uint32 val, uint32 limit) Useful for managing refcounts like pin counts. * pg_atomic_(add|sub|and|or)_fetch_u32() Generic wrappers for more legible code. No per-platform/compiler overhead. * pg_atomic_compare_and_swap_128() Many algorithms are faster (e.g. some lockless hashtables, which'd be great for the buffer mapping) when a double-pointer-width CAS is available, but also work with an pointer-width CAS in a less efficient manner. > Or if you like: no matter what you do, the first feature submission > that makes use of a given atomic op is going to suffer pain. I don't > want to still be suffering that pain two or three years out. The shorter > the list of allowed atomic ops, the sooner we'll be done with debugging > them. Yea. As, partially, even evidenced by this discussion ;). Which would you like to see go? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, (Coming back to this now) On 2013-10-28 21:55:22 +0100, Andres Freund wrote: > The list I have previously suggested was: > > * pg_atomic_load_u32(uint32 *) > * uint32 pg_atomic_store_u32(uint32 *) > > To be able to write code without spreading volatiles all over while > making it very clear that that part of the code is worthy of attention. > > * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val) > * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval) So what I've previously proposed did use plain types for the to-be-manipulated atomic integers. I am not sure about that anymore though, C11 includes support for atomic operations, but it assumes that the to-be-manipulated variable is of a special "atomic" type. While I am not propose that we try to emulate C11's API completely (basically impossible as it uses type agnostic functions, it also exposes too much), it seems to make sense to allow PG's atomics to be implemented by C11's. The API is described at: http://en.cppreference.com/w/c/atomic The fundamental difference to what I've suggested so far is that the atomic variable isn't a plain integer/type but a distinct datatype that can *only* be manipulated via the atomic operators. That has the nice property that it cannot be accidentally manipulated via plain operators. E.g. it wouldn't be uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_); but uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_); where, for now, atomic_uint32 is something like: typedef struct pg_atomic_uint32 {volatile uint32 value; } pg_atomic_uint32; a future C11 implementation would just typedef C11's respective atomic type to pg_atomic_uint32. Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 5, 2013 at 10:51 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > (Coming back to this now) > > On 2013-10-28 21:55:22 +0100, Andres Freund wrote: >> The list I have previously suggested was: >> >> * pg_atomic_load_u32(uint32 *) >> * uint32 pg_atomic_store_u32(uint32 *) >> >> To be able to write code without spreading volatiles all over while >> making it very clear that that part of the code is worthy of attention. >> >> * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val) >> * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval) > > So what I've previously proposed did use plain types for the > to-be-manipulated atomic integers. I am not sure about that anymore > though, C11 includes support for atomic operations, but it assumes that > the to-be-manipulated variable is of a special "atomic" type. While I am > not propose that we try to emulate C11's API completely (basically > impossible as it uses type agnostic functions, it also exposes too > much), it seems to make sense to allow PG's atomics to be implemented by > C11's. > > The API is described at: http://en.cppreference.com/w/c/atomic > > The fundamental difference to what I've suggested so far is that the > atomic variable isn't a plain integer/type but a distinct datatype that > can *only* be manipulated via the atomic operators. That has the nice > property that it cannot be accidentally manipulated via plain operators. > > E.g. it wouldn't be > uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_); > but > uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_); > > where, for now, atomic_uint32 is something like: > typedef struct pg_atomic_uint32 > { > volatile uint32 value; > } pg_atomic_uint32; > a future C11 implementation would just typedef C11's respective atomic > type to pg_atomic_uint32. > > Comments? Sadly, I don't have a clear feeling for how well or poorly that's going to work out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 5, 2013 at 5:51 PM, Andres Freund <andres@2ndquadrant.com> wrote: > So what I've previously proposed did use plain types for the > to-be-manipulated atomic integers. I am not sure about that anymore > though, C11 includes support for atomic operations, but it assumes that > the to-be-manipulated variable is of a special "atomic" type. While I am > not propose that we try to emulate C11's API completely (basically > impossible as it uses type agnostic functions, it also exposes too > much), it seems to make sense to allow PG's atomics to be implemented by > C11's. > > The API is described at: http://en.cppreference.com/w/c/atomic > > The fundamental difference to what I've suggested so far is that the > atomic variable isn't a plain integer/type but a distinct datatype that > can *only* be manipulated via the atomic operators. That has the nice > property that it cannot be accidentally manipulated via plain operators. > > E.g. it wouldn't be > uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_); > but > uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_); > > where, for now, atomic_uint32 is something like: > typedef struct pg_atomic_uint32 > { > volatile uint32 value; > } pg_atomic_uint32; > a future C11 implementation would just typedef C11's respective atomic > type to pg_atomic_uint32. > > Comments? C11 atomics need to be initialized through atomic_init(), so a simple typedef will not be correct here. Also, C11 atomics are designed to have the compiler take care of memory barriers - so they might not always be a perfect match to our API's, or any API implementable without compiler support. However I'm mildly supportive on having a separate type for variables accessed by atomics. It can result in some unnecessary code churn, but in general if an atomic access to a variable is added, then all other accesses to it need to be audited for memory barriers, even if they were previously correctly synchronized by a lock. I guess the best approach for deciding would be to try to convert a couple of the existing unlocked accesses to the API and see what the patch looks like. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-11-06 16:48:17 +0200, Ants Aasma wrote: > C11 atomics need to be initialized through atomic_init(), so a simple > typedef will not be correct here. Also, C11 atomics are designed to > have the compiler take care of memory barriers - so they might not > always be a perfect match to our API's, or any API implementable > without compiler support. Yea, I think we'll always need to have our layer above C11 atomics, but it still will be useful to allow them to be used to implement our atomics. FWIW, I find the requirement for atomic_init() really, really annoying. Not that that will change anything ;) > However I'm mildly supportive on having a separate type for variables > accessed by atomics. It can result in some unnecessary code churn, but > in general if an atomic access to a variable is added, then all other > accesses to it need to be audited for memory barriers, even if they > were previously correctly synchronized by a lock. Ok, that's what I am writing right now. > I guess the best approach for deciding would be to try to convert a > couple of the existing unlocked accesses to the API and see what the > patch looks like. I don't think there really exist any interesting ones? I am using my lwlock reimplementation as a testbed so far. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 6, 2013 at 4:54 PM, Andres Freund <andres@2ndquadrant.com> wrote: > FWIW, I find the requirement for atomic_init() really, really > annoying. Not that that will change anything ;) Me too, but I guess this allows them to emulate atomics on platforms that don't have native support. Whether that is a good idea is a separate question. >> However I'm mildly supportive on having a separate type for variables >> accessed by atomics. It can result in some unnecessary code churn, but >> in general if an atomic access to a variable is added, then all other >> accesses to it need to be audited for memory barriers, even if they >> were previously correctly synchronized by a lock. > > Ok, that's what I am writing right now. > >> I guess the best approach for deciding would be to try to convert a >> couple of the existing unlocked accesses to the API and see what the >> patch looks like. > > I don't think there really exist any interesting ones? I am using my > lwlock reimplementation as a testbed so far. BufferDesc management in src/backend/storage/buffer/bufmgr.c. The seqlock like thing used to provide consistent reads from PgBackendStatus src/backend/postmaster/pgstat.c (this code looks like it's broken on weakly ordered machines) The unlocked reads that are done from various procarray variables. The XLogCtl accesses in xlog.c. IMO the volatile keyword should be confined to the code actually implementing atomics, locking. The "use volatile pointer to prevent code rearrangement" comment is evil. If there are data races in the code, they need comments pointing this out and probably memory barriers too. If there are no data races, then the volatile declaration is useless and a pessimization. Currently it's more of a catch all "here be dragons" declaration for data structures. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-11-06 08:15:59 -0500, Robert Haas wrote: > > The API is described at: http://en.cppreference.com/w/c/atomic > > > > The fundamental difference to what I've suggested so far is that the > > atomic variable isn't a plain integer/type but a distinct datatype that > > can *only* be manipulated via the atomic operators. That has the nice > > property that it cannot be accidentally manipulated via plain operators. > > > > E.g. it wouldn't be > > uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_); > > but > > uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_); > > > > where, for now, atomic_uint32 is something like: > > typedef struct pg_atomic_uint32 > > { > > volatile uint32 value; > > } pg_atomic_uint32; > > a future C11 implementation would just typedef C11's respective atomic > > type to pg_atomic_uint32. > > > > Comments? > > Sadly, I don't have a clear feeling for how well or poorly that's > going to work out. I've implemented it that way and it imo looks sensible. I dislike the pg_atomic_init_(flag|u32|u64) calls that are forced on us by wanting to be compatible with C11, but I think that doesn't end up being too bad. So, attached is a very first draft of an atomics implementation. There's lots to be done, but this is how I think it should roughly look like and what things we should implement in the beginning. The API should be understandable from looking at src/include/storage/atomics.h I haven't tested the IBM xlc, HPUX IA64 acc, Sunpro fallback at all, but the important part is that those provide the necessary tools to implement everything. Also, even the gcc implementation falls back to compare_and_swap() based implementations for the operations, but that shouldn't matter for now. I haven't looked for other compilers yet. Questions: * Fundamental issues with the API? * should we split of compiler/arch specific parts of into include/ports or such? -0.5 from me. * should we add src/include/storage/atomics subdirectory? +0.5 from me. * Do we really want to continue to cater to compilers not supporting PG_USE_INLINE? I've tried to add support for them, but it does make things considerably more #ifdef-y. Afaik it's only HPUX's acc that has problems, and it seems to work but generate warnings, so maybe that's ok? * Which additional compilers do we want to support? icc (might be gcc compatible)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-11-06 17:47:45 +0200, Ants Aasma wrote: > On Wed, Nov 6, 2013 at 4:54 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > FWIW, I find the requirement for atomic_init() really, really > > annoying. Not that that will change anything ;) > > Me too, but I guess this allows them to emulate atomics on platforms > that don't have native support. Whether that is a good idea is a > separate question. Since static initialization is supported that would require quite some dirty hacks, but well, we're talking about compiler makers ;) > > I don't think there really exist any interesting ones? I am using my > > lwlock reimplementation as a testbed so far. > > BufferDesc management in src/backend/storage/buffer/bufmgr.c. > The unlocked reads that are done from various procarray variables. > The XLogCtl accesses in xlog.c. Those are actually only modified under spinlocks afair, so there isn't too much interesting happening. But yes, it'd be better if we used more explicit means to manipulate/query them. > The seqlock like thing used to provide consistent reads from > PgBackendStatus src/backend/postmaster/pgstat.c (this code looks like > it's broken on weakly ordered machines) Whoa. Never noticed that bit. The consequences of races are quite low, but still... > IMO the volatile keyword should be confined to the code actually > implementing atomics, locking. The "use volatile pointer to prevent > code rearrangement" comment is evil. If there are data races in the > code, they need comments pointing this out and probably memory > barriers too. If there are no data races, then the volatile > declaration is useless and a pessimization. Currently it's more of a > catch all "here be dragons" declaration for data structures. Agreed. But I don't think we can replace them all at once. Or well, I won't ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-06 17:24:57 +0100, Andres Freund wrote: > Questions: > * Do we really want to continue to cater to compilers not supporting > PG_USE_INLINE? I've tried to add support for them, but it does make > things considerably more #ifdef-y. > Afaik it's only HPUX's acc that has problems, and it seems to work but > generate warnings, so maybe that's ok? Maybe we can simply silence that specific warning for HPUX when using aC++ like in the attached patch? It's not like somebody really looks at those anyway given how bleaty it is. Or we can just generally remove the "quiet inline" check. I haven't tested the patch since I don't have a HPUX machine but that's the option to use according to http://h21007.www2.hp.com/portal/download/files/unprot/aCxx/Online_Help/options.htm#opt+Wargs Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, Instead of de-supporting platforms that don't have CAS support or providing parallel implementations we could relatively easily build a spinlock based fallback using the already existing requirement for tas(). Something like an array of 16 spinlocks, indexed by a more advanced version of ((char *)(&atomics) >> sizeof(char *)) % 16. The platforms that would fallback aren't that likely to be used under heavy concurrency, so the price for that shouldn't be too high. The only real problem with that would be that we'd need to remove the spinnlock fallback for barriers, but that seems to be pretty much disliked. Thoughts? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 11, 2013 at 1:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Instead of de-supporting platforms that don't have CAS support or > providing parallel implementations we could relatively easily build a > spinlock based fallback using the already existing requirement for > tas(). > Something like an array of 16 spinlocks, indexed by a more advanced > version of ((char *)(&atomics) >> sizeof(char *)) % 16. The platforms > that would fallback aren't that likely to be used under heavy > concurrency, so the price for that shouldn't be too high. > > The only real problem with that would be that we'd need to remove the > spinnlock fallback for barriers, but that seems to be pretty much > disliked. I think this is worth considering. I'm not too clear what to do about the barriers problem, though. I feel like we've dug ourselves into a bit of a hole, there, and I'm not sure I understand the issues well enough to dig us back out of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-12 13:21:30 -0500, Robert Haas wrote: > > The only real problem with that would be that we'd need to remove the > > spinnlock fallback for barriers, but that seems to be pretty much > > disliked. > > I think this is worth considering. Ok, cool. The prototype patch I have for that is pretty small, so it doesn't look too bad. What currently scares me is the amount of code I have to write that I can't test... I really can't see me being able to provide a patch that doesn't require some buildfarm cycles to really work on all platforms. > I'm not too clear what to do about > the barriers problem, though. I feel like we've dug ourselves into a > bit of a hole, there, and I'm not sure I understand the issues well > enough to dig us back out of it. I think any platform where we aren't able to provide a proper compiler/memory barrier will also have broken spinlock relase semantics (as in missing release memory barrier). So arguably removing the fallback is a good idea anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, This is the first real submission of how I think our atomics architecture should look like. It's definitely a good bit from complete, please view it from that POV! The most important thing I need at this point is feedback whether the API in general looks ok and what should be changed, before I spend the time to implement it nicely everywhere. The API is loosely modeled to match C11's atomics http://en.cppreference.com/w/c/atomic library, at least to the degree that the latter could be used to implement pg's atomics. Please note: * Only gcc's implementation is tested. I'd be really surprised if it compiled on anything else. * I've currently replaced nearly all of s_lock.h to use parts of the atomics API. We might rather use it's contents to implement atomics 'flags' on some of the wierder platforms out there. * I've integrated barrier.h into the atomics code. It's pretty much a required part, and compiler/arch specific so that seems to make sense. * 'autoreconf' needs to be run after applying the patchset. I got some problems with running autoconf 2.63 here, using 2.69 makes the diff to big. Questions: * What do you like/dislike about the API (storage/atomics.h) * decide whether it's ok to rely on inline functions or whether we need to provide out-of-line versions. I think we should just bite the bullet and require support. Some very old arches might need to live with warnings. Patch 01 tries to silence them on HP-UX. * decide whether we want to keep the semaphore fallback for spinlocks. I strongly vote for removing it. The patchset provides compiler based default implementations for atomics, so it's unlikely to be helpful when bringing up a new platform. TODO: * actually test on other platforms that new gcc * We should probably move most of the (documentation) content of s_lock.h somewhere else. * Integrate the spinlocks based fallback for platforms that only provide something like TAS(). Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Fri, Nov 15, 2013 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Questions: > * What do you like/dislike about the API (storage/atomics.h) > * decide whether it's ok to rely on inline functions or whether we need > to provide out-of-line versions. I think we should just bite the > bullet and require support. Some very old arches might need to live with > warnings. Patch 01 tries to silence them on HP-UX. > * decide whether we want to keep the semaphore fallback for > spinlocks. I strongly vote for removing it. The patchset provides > compiler based default implementations for atomics, so it's unlikely > to be helpful when bringing up a new platform. On point #2, I don't personally know of any systems that I care about where inlining isn't supported. However, we've gone to quite a bit of trouble relatively recently to keep things working for platforms where that is the case, so I feel the need for an obligatory disclaimer: I will not commit any patch that moves the wood in that area unless there is massive consensus that this is an acceptable way forward. Similarly for point #3: Heikki not long ago went to the trouble of unbreaking --disable-spinlocks, which suggests that there is at least some interest in continuing to be able to run in that mode. I'm perfectly OK with the decision that we don't care about that any more, but I will not be the one to make that decision, and I think it requires a greater-than-normal level of consensus. On point #1, I dunno. It looks like a lot of rearrangement to me, and I'm not really sure what the final form of it is intended to be. I think it desperately needs a README explaining what the point of all this is and how to add support for a new platform or compiler if yours doesn't work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On point #2, I don't personally know of any systems that I care about > where inlining isn't supported. However, we've gone to quite a bit of > trouble relatively recently to keep things working for platforms where > that is the case, so I feel the need for an obligatory disclaimer: I > will not commit any patch that moves the wood in that area unless > there is massive consensus that this is an acceptable way forward. > Similarly for point #3: Heikki not long ago went to the trouble of > unbreaking --disable-spinlocks, which suggests that there is at least > some interest in continuing to be able to run in that mode. I'm > perfectly OK with the decision that we don't care about that any more, > but I will not be the one to make that decision, and I think it > requires a greater-than-normal level of consensus. Hm. Now that I think about it, isn't Peter proposing to break systems without working "inline" over here? http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net Maybe that patch needs a bit more discussion. I tend to agree that significantly moving our portability goalposts shouldn't be undertaken lightly. On the other hand, it is 2013, and so it seems not unreasonable to reconsider choices last made more than a dozen years ago. Here's an idea that might serve as a useful touchstone for making choices: compiler inadequacies are easier to fix than hardware/OS inadequacies. Even a dozen years ago, people could get gcc running on any platform of interest, and I seriously doubt that's less true today than back then. So if anyone complains "my compiler doesn't support 'const'", we could reasonably reply "so install gcc". On the other hand, if the complaint is "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them to buy a new server. regards, tom lane
On 11/19/13, 9:57 AM, Tom Lane wrote: > Hm. Now that I think about it, isn't Peter proposing to break systems > without working "inline" over here? > http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net No, that's about const, volatile, #, and memcmp. I don't have an informed opinion about requiring inline support (although it would surely be nice).
On 2013-11-19 09:57:20 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On point #2, I don't personally know of any systems that I care about > > where inlining isn't supported. However, we've gone to quite a bit of > > trouble relatively recently to keep things working for platforms where > > that is the case, so I feel the need for an obligatory disclaimer: I > > will not commit any patch that moves the wood in that area unless > > there is massive consensus that this is an acceptable way forward. > > Similarly for point #3: Heikki not long ago went to the trouble of > > unbreaking --disable-spinlocks, which suggests that there is at least > > some interest in continuing to be able to run in that mode. I'm > > perfectly OK with the decision that we don't care about that any more, > > but I will not be the one to make that decision, and I think it > > requires a greater-than-normal level of consensus. > > Hm. Now that I think about it, isn't Peter proposing to break systems > without working "inline" over here? > http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net Hm. Those don't directly seem to affect our inline tests. Our test is PGAC_C_INLINE which itself uses AC_C_INLINE and then tests that unreferenced inlines don't generate warnings. > Maybe that patch needs a bit more discussion. I tend to agree that > significantly moving our portability goalposts shouldn't be undertaken > lightly. On the other hand, it is 2013, and so it seems not unreasonable > to reconsider choices last made more than a dozen years ago. I don't think there's even platforms out there that don't support const inlines, just some that unnecessarily generate warnings. But since builds on those platforms are already littered with warnings, that doesn't seem something we need to majorly be concerned with. The only animal we have that doesn't support quiet inlines today is HP-UX/ac++, and I think - as in patch 1 in the series - we might be able to simply suppress the warning there. > Here's an idea that might serve as a useful touchstone for making choices: > compiler inadequacies are easier to fix than hardware/OS inadequacies. > Even a dozen years ago, people could get gcc running on any platform of > interest, and I seriously doubt that's less true today than back then. > So if anyone complains "my compiler doesn't support 'const'", we could > reasonably reply "so install gcc". Agreed. > On the other hand, if the complaint is > "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them > to buy a new server. Agreed. I've am even wondering about 32bit CAS since it's not actually that hard to emulate it using spinlocks. Certainly less work than arguing about removing stone-age platforms ;) There's no fundamental issue with continuing to support semaphore based spinlocks I am just wondering about the point of supporting that configuration since it will always yield horrible performance. The only fundamental thing that I don't immediately see how we can support is the spinlock based memory barrier since that introduces a circularity (atomics need barrier, barrier needs spinlocks, spinlock needs atomics). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On point #2, I don't personally know of any systems that I care about >> where inlining isn't supported. However, we've gone to quite a bit of >> trouble relatively recently to keep things working for platforms where >> that is the case, so I feel the need for an obligatory disclaimer: I >> will not commit any patch that moves the wood in that area unless >> there is massive consensus that this is an acceptable way forward. >> Similarly for point #3: Heikki not long ago went to the trouble of >> unbreaking --disable-spinlocks, which suggests that there is at least >> some interest in continuing to be able to run in that mode. I'm >> perfectly OK with the decision that we don't care about that any more, >> but I will not be the one to make that decision, and I think it >> requires a greater-than-normal level of consensus. > > Hm. Now that I think about it, isn't Peter proposing to break systems > without working "inline" over here? > http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net > > Maybe that patch needs a bit more discussion. I tend to agree that > significantly moving our portability goalposts shouldn't be undertaken > lightly. On the other hand, it is 2013, and so it seems not unreasonable > to reconsider choices last made more than a dozen years ago. Sure. inline isn't C89, and to date, we've been quite rigorous about enforcing C89-compliance, so I'm going to try not to be the guy that casually breaks that. On the other hand, I haven't personally seen any systems that don't have inline in a very long time. The first systems I did development on were not even C89; they were K&R, and you had to use old-style function declarations. It's probably safe to say that they didn't support inline, either, though I don't recall trying it. But the last time I used a system like that was probably in the early nineties, and it's been a long time since then. However, I upgrade my hardware about every 2 years, so I'm not the best guy to say what the oldest stuff people still care about is. > Here's an idea that might serve as a useful touchstone for making choices: > compiler inadequacies are easier to fix than hardware/OS inadequacies. > Even a dozen years ago, people could get gcc running on any platform of > interest, and I seriously doubt that's less true today than back then. > So if anyone complains "my compiler doesn't support 'const'", we could > reasonably reply "so install gcc". On the other hand, if the complaint is > "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them > to buy a new server. I generally agree with that principle, but I'd offer the caveat that some people do still have valid reasons for wanting to use non-GCC compilers. I have been told that HP's aCC produces better results on Itanium than gcc, for example. Still, I think we could probably make a list of no more than half-a-dozen compilers that we really care about - the popular open source ones (is there anything we need to care about other than gcc and clang?) and the ones supported by large vendors (IBM's ICC and HP's aCC being the ones that I know about) and adopt features that are supported by everything on that list. Another point is that, some releases ago, we began requiring working 64-bit arithmetic as per commit d15cb38dec01fcc8d882706a3fa493517597c76b. We might want to go through our list of nominally-supported platforms and see whether any of them would fail to compile for that reason. Based on the commit message for the aforementioned commit, such platforms haven't been viable for PostgreSQL since 8.3, so we're not losing anything by canning them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 19, 2013 at 10:16 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> On the other hand, if the complaint is >> "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them >> to buy a new server. > > Agreed. I've am even wondering about 32bit CAS since it's not actually > that hard to emulate it using spinlocks. Certainly less work than > arguing about removing stone-age platforms ;) > > There's no fundamental issue with continuing to support semaphore based > spinlocks I am just wondering about the point of supporting that > configuration since it will always yield horrible performance. > > The only fundamental thing that I don't immediately see how we can > support is the spinlock based memory barrier since that introduces a > circularity (atomics need barrier, barrier needs spinlocks, spinlock > needs atomics). We've been pretty much assuming for a long time that calling a function in another translation unit acts as a compiler barrier. There's a lot of code that isn't actually safe against global optimization; we assume, for example, that memory accesses can't move over an LWLockAcquire(), but that's just using spinlocks internally, and those aren't guaranteed to be compiler barriers, per previous discussion. So one idea for a compiler barrier is just to define a function call pg_compiler_barrier() in a file by itself, and make that the fallback implementation. That will of course fail if someone uses a globally optimizing compiler, but I think it'd be OK to say that if you want to do that, you'd better have a real barrier implementation. Right now, it's probably unsafe regardless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-19 09:12:42 -0500, Robert Haas wrote: > On point #1, I dunno. It looks like a lot of rearrangement to me, and > I'm not really sure what the final form of it is intended to be. Understandable. I am not that sure what parts we want to rearange either. We very well could leave barrier.h and s_lock.h untouched and just maintain the atomics stuff in parallel and switch over at some later point. I've mainly switched over s_lock.h to a) get some good coverage of the atomics code b) make sure the api works fine for it. We could add the atomics.h based implementation as a fallback for the cases in which no native implementation is available. > I think it desperately needs a README explaining what the point of all > this is and how to add support for a new platform or compiler if yours > doesn't work. Ok, I'll work on that. It would be useful to get feedback on the "frontend" API in atomics.h though. If people are unhappy with the API it might very well change how we can implement the API on different architectures. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 10:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-19 09:12:42 -0500, Robert Haas wrote: >> On point #1, I dunno. It looks like a lot of rearrangement to me, and >> I'm not really sure what the final form of it is intended to be. > > Understandable. I am not that sure what parts we want to rearange > either. We very well could leave barrier.h and s_lock.h untouched and > just maintain the atomics stuff in parallel and switch over at some > later point. > I've mainly switched over s_lock.h to a) get some good coverage of the > atomics code b) make sure the api works fine for it. We could add the > atomics.h based implementation as a fallback for the cases in which no > native implementation is available. > >> I think it desperately needs a README explaining what the point of all >> this is and how to add support for a new platform or compiler if yours >> doesn't work. > > Ok, I'll work on that. It would be useful to get feedback on the > "frontend" API in atomics.h though. If people are unhappy with the API > it might very well change how we can implement the API on different > architectures. Sure, but if people can't understand the API, then it's hard to decide whether to be unhappy with it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Peter Eisentraut <peter_e@gmx.net> writes: > On 11/19/13, 9:57 AM, Tom Lane wrote: >> Hm. Now that I think about it, isn't Peter proposing to break systems >> without working "inline" over here? >> http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net > No, that's about const, volatile, #, and memcmp. Ah, sorry, not enough caffeine absorbed yet. Still, we should stop to think about whether this represents an undesirable move of the portability goalposts. The first three of these are certainly compiler issues, and I personally don't have a problem with blowing off compilers that still haven't managed to implement all of C89 :-(. I'm not clear on which systems had the memcmp issue --- do we have the full story on that? > I don't have an informed opinion about requiring inline support > (although it would surely be nice). inline is C99, and we've generally resisted requiring C99 features. Maybe it's time to move that goalpost, and maybe not. regards, tom lane
On 2013-11-19 10:23:57 -0500, Robert Haas wrote: > > The only fundamental thing that I don't immediately see how we can > > support is the spinlock based memory barrier since that introduces a > > circularity (atomics need barrier, barrier needs spinlocks, spinlock > > needs atomics). > > We've been pretty much assuming for a long time that calling a > function in another translation unit acts as a compiler barrier. > There's a lot of code that isn't actually safe against global > optimization; we assume, for example, that memory accesses can't move > over an LWLockAcquire(), but that's just using spinlocks internally, > and those aren't guaranteed to be compiler barriers, per previous > discussion. So one idea for a compiler barrier is just to define a > function call pg_compiler_barrier() in a file by itself, and make that > the fallback implementation. That will of course fail if someone uses > a globally optimizing compiler, but I think it'd be OK to say that if > you want to do that, you'd better have a real barrier implementation. That works for compiler, but not for memory barriers :/ > Right now, it's probably unsafe regardless. Yes, I have pretty little trust in the current state. Both from the infrastructure perspective (spinlocks, barriers) as from individual pieces of code. To a good part we're probably primarily protected by x86's black magic and the fact that everyone with sufficient concurrency to see problems uses x86. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-19 10:23:57 -0500, Robert Haas wrote: >> > The only fundamental thing that I don't immediately see how we can >> > support is the spinlock based memory barrier since that introduces a >> > circularity (atomics need barrier, barrier needs spinlocks, spinlock >> > needs atomics). >> >> We've been pretty much assuming for a long time that calling a >> function in another translation unit acts as a compiler barrier. >> There's a lot of code that isn't actually safe against global >> optimization; we assume, for example, that memory accesses can't move >> over an LWLockAcquire(), but that's just using spinlocks internally, >> and those aren't guaranteed to be compiler barriers, per previous >> discussion. So one idea for a compiler barrier is just to define a >> function call pg_compiler_barrier() in a file by itself, and make that >> the fallback implementation. That will of course fail if someone uses >> a globally optimizing compiler, but I think it'd be OK to say that if >> you want to do that, you'd better have a real barrier implementation. > > That works for compiler, but not for memory barriers :/ True, but we already assume that a spinlock is a memory barrier minus a compiler barrier. So if you have a working compiler barrier, you ought to be able to fix spinlocks to be memory barriers. And then, if you need a memory barrier for some other purpose, you can always fall back to acquiring and releasing a spinlock. Maybe that's too contorted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-19 10:30:24 -0500, Tom Lane wrote: > > I don't have an informed opinion about requiring inline support > > (although it would surely be nice). > > inline is C99, and we've generally resisted requiring C99 features. > Maybe it's time to move that goalpost, and maybe not. But it's a part of C99 that was very widely implemented before, so even if we don't want to rely on C99 in its entirety, relying on inline support is realistic. I think, independent from atomics, the readability & maintainability win by relying on inline functions instead of long macros, potentially with multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is significant. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > The only animal we have that doesn't support quiet inlines today is > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able > to simply suppress the warning there. Or just not worry about it, if it's only a warning? Or does the warning mean code bloat (lots of useless copies of the inline function)? regards, tom lane
On 2013-11-19 10:37:35 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > The only animal we have that doesn't support quiet inlines today is > > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able > > to simply suppress the warning there. > > Or just not worry about it, if it's only a warning? Or does the warning > mean code bloat (lots of useless copies of the inline function)? I honestly have no idea whether it causes code bloat - I'd be surprised if it did since it detects that they are unused, but I cannot rule it out entirely. The suggested patch - untested since I have no access to HP-UX - just adds +W2177 to the compiler's commandline in template/hpux which supposedly suppressed that warning. I think removing the quiet inline test is a good idea, but that doesn't preclude fixing the warnings at the same time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-11-19 10:27:57 -0500, Robert Haas wrote: > > Ok, I'll work on that. It would be useful to get feedback on the > > "frontend" API in atomics.h though. If people are unhappy with the API > > it might very well change how we can implement the API on different > > architectures. > > Sure, but if people can't understand the API, then it's hard to decide > whether to be unhappy with it. Fair point, while there's already an explantory blurb ontop of atomics.h and the semantics of the functions should all be documented, it doesn't explain the API on a high level. I've expanded it a bit The attached patches compile and make check successfully on linux x86, amd64 and msvc x86 and amd64. This time and updated configure is included. The majority of operations still rely on CAS emulation. I've copied the introduction here for easier reading: /*------------------------------------------------------------------------- * * atomics.h * Generic atomic operations support. * * Hardware and compiler dependent functions for manipulating memory * atomically and dealing with cache coherency. Used to implement locking * facilities and other concurrency safe constructs. * * There's three data types which can be manipulated atomically: * * * pg_atomic_flag - supports atomic test/clear operations, useful for * implementing spinlocks and similar constructs. * * * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated * by several processes at the same time. * * * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support * can be tested by checking for PG_HAVE_64_BIT_ATOMICS. * * The values stored in theses cannot be accessed using the API provided in * thise file, i.e. it is not possible to write statements like `var += * 10`. Instead, for this example, one would have to use * `pg_atomic_fetch_add_u32(&var, 10)`. * * This restriction has several reasons: Primarily it doesn't allow non-atomic * math to be performed on these values which wouldn't be concurrency * safe. Secondly it allows to implement these types ontop of facilities - * like C11's atomics - that don't provide direct access. Lastly it serves as * a useful hint what code should be looked at more carefully because of * concurrency concerns. * * To be useful they usually will need to be placed in memory shared between * processes or threads, most frequently by embedding them in structs. Be * careful to align atomic variables to their own size! * * Before variables of one these types can be used they needs to be * initialized using the type's initialization function (pg_atomic_init_flag, * pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global * pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These * initializations have to be performed before any (possibly concurrent) * access is possible, otherwise the results are undefined. * * For several mathematical operations two variants exist: One that returns * the old value before the operation was performed, and one that that returns * the new value. *_fetch_<op> variants will return the old value, * *_<op>_fetch the new one. * * * To bring up postgres on a platform/compiler at the very least * either one of * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag() * * pg_atomic_compare_exchange_u32() * * pg_atomic_exchange_u32() * and all of * * pg_compiler_barrier(), pg_memory_barrier()a * need to be implemented. There exist generic, hardware independent, * implementations for several compilers which might be sufficient, although * possibly not optimal, for a new platform. * * To add support for a new architecture review whether the compilers used on * that platform already provide adequate support in the files included * "Compiler specific" section below. If not either add an include for a new * file adding generic support for your compiler there and/or add/update an * architecture specific include in the "Architecture specific" section below. * * Operations that aren't implemented by either architecture or compiler * specific code are implemented ontop of compare_exchange() loops or * spinlocks. * * Use higher level functionality (lwlocks, spinlocks, heavyweight locks) * whenever possible. Writing correct code using these facilities is hard. * * For an introduction to using memory barriers within the PostgreSQL backend, * see src/backend/storage/lmgr/README.barrier * * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * src/include/storage/atomics.h * *------------------------------------------------------------------------- */ Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Tue, Nov 19, 2013 at 11:38 AM, Andres Freund <andres@2ndquadrant.com> wrote: > /*------------------------------------------------------------------------- > * > * atomics.h > * Generic atomic operations support. > * > * Hardware and compiler dependent functions for manipulating memory > * atomically and dealing with cache coherency. Used to implement locking > * facilities and other concurrency safe constructs. > * > * There's three data types which can be manipulated atomically: > * > * * pg_atomic_flag - supports atomic test/clear operations, useful for > * implementing spinlocks and similar constructs. > * > * * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated > * by several processes at the same time. > * > * * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support > * can be tested by checking for PG_HAVE_64_BIT_ATOMICS. > * > * The values stored in theses cannot be accessed using the API provided in > * thise file, i.e. it is not possible to write statements like `var += > * 10`. Instead, for this example, one would have to use > * `pg_atomic_fetch_add_u32(&var, 10)`. > * > * This restriction has several reasons: Primarily it doesn't allow non-atomic > * math to be performed on these values which wouldn't be concurrency > * safe. Secondly it allows to implement these types ontop of facilities - > * like C11's atomics - that don't provide direct access. Lastly it serves as > * a useful hint what code should be looked at more carefully because of > * concurrency concerns. > * > * To be useful they usually will need to be placed in memory shared between > * processes or threads, most frequently by embedding them in structs. Be > * careful to align atomic variables to their own size! What does that mean exactly? > * Before variables of one these types can be used they needs to be > * initialized using the type's initialization function (pg_atomic_init_flag, > * pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global > * pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These > * initializations have to be performed before any (possibly concurrent) > * access is possible, otherwise the results are undefined. > * > * For several mathematical operations two variants exist: One that returns > * the old value before the operation was performed, and one that that returns > * the new value. *_fetch_<op> variants will return the old value, > * *_<op>_fetch the new one. Ugh. Do we really need this much complexity? > * Use higher level functionality (lwlocks, spinlocks, heavyweight locks) > * whenever possible. Writing correct code using these facilities is hard. > * > * For an introduction to using memory barriers within the PostgreSQL backend, > * see src/backend/storage/lmgr/README.barrier The extent to which these primitives themselves provide ordering guarantees should be documented. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-19 12:43:44 -0500, Robert Haas wrote: > > * To be useful they usually will need to be placed in memory shared between > > * processes or threads, most frequently by embedding them in structs. Be > > * careful to align atomic variables to their own size! > > What does that mean exactly? The alignment part? A pg_atomic_flag needs to be aligned to sizeof(pg_atomic_flag), pg_atomic_uint32,64s to their sizeof() respectively. When embedding them inside a struct the whole struct needs to be allocated at least at that alignment. Not sure how to describe that concisely. I've added wrapper functions around the implementation that test for alignment to make sure it's easy to spot such mistakes which could end up having ugly results on some platforms. > > * For several mathematical operations two variants exist: One that returns > > * the old value before the operation was performed, and one that that returns > > * the new value. *_fetch_<op> variants will return the old value, > > * *_<op>_fetch the new one. > > Ugh. Do we really need this much complexity? Well, based on my experience it's really what you need when writing code using it. The _<op>_fetch variants are implemented generically using the _fetch_<op> operations, so leaving them out wouldn't win us much. What would you like to remove? > > * Use higher level functionality (lwlocks, spinlocks, heavyweight locks) > > * whenever possible. Writing correct code using these facilities is hard. > > * > > * For an introduction to using memory barriers within the PostgreSQL backend, > > * see src/backend/storage/lmgr/README.barrier > > The extent to which these primitives themselves provide ordering > guarantees should be documented. Every function should have a comment to that regard, although two have an XXX where I am not sure what we want to mandate, specifically whether pg_atomic_test_set_flag() should be a full or acquire barrier and whether pg_atomic_clear_flag should be full or release. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote: > On 2013-11-19 10:30:24 -0500, Tom Lane wrote: > > > I don't have an informed opinion about requiring inline support > > > (although it would surely be nice). > > > > inline is C99, and we've generally resisted requiring C99 features. > > Maybe it's time to move that goalpost, and maybe not. > > But it's a part of C99 that was very widely implemented before, so even > if we don't want to rely on C99 in its entirety, relying on inline > support is realistic. > > I think, independent from atomics, the readability & maintainability win > by relying on inline functions instead of long macros, potentially with > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is > significant. Oh, man, my fastgetattr() macro is going to be simplified. All my good work gets rewritten. ;-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote: > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote: > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote: > > > > I don't have an informed opinion about requiring inline support > > > > (although it would surely be nice). > > > > > > inline is C99, and we've generally resisted requiring C99 features. > > > Maybe it's time to move that goalpost, and maybe not. > > > > But it's a part of C99 that was very widely implemented before, so even > > if we don't want to rely on C99 in its entirety, relying on inline > > support is realistic. > > > > I think, independent from atomics, the readability & maintainability win > > by relying on inline functions instead of long macros, potentially with > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is > > significant. > > Oh, man, my fastgetattr() macro is going to be simplified. All my good > work gets rewritten. ;-) That and HeapKeyTest() alone are sufficient reason for this ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote: > On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote: > > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote: > > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote: > > > > > I don't have an informed opinion about requiring inline support > > > > > (although it would surely be nice). > > > > > > > > inline is C99, and we've generally resisted requiring C99 features. > > > > Maybe it's time to move that goalpost, and maybe not. > > > > > > But it's a part of C99 that was very widely implemented before, so even > > > if we don't want to rely on C99 in its entirety, relying on inline > > > support is realistic. > > > > > > I think, independent from atomics, the readability & maintainability win > > > by relying on inline functions instead of long macros, potentially with > > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is > > > significant. > > > > Oh, man, my fastgetattr() macro is going to be simplified. All my good > > work gets rewritten. ;-) > > That and HeapKeyTest() alone are sufficient reason for this ;) Has there been any performance testing on this rewrite to use atomics? If so, can I missed it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote: > On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote: > > On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote: > > > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote: > > > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote: > > > > > > I don't have an informed opinion about requiring inline support > > > > > > (although it would surely be nice). > > > > > > > > > > inline is C99, and we've generally resisted requiring C99 features. > > > > > Maybe it's time to move that goalpost, and maybe not. > > > > > > > > But it's a part of C99 that was very widely implemented before, so even > > > > if we don't want to rely on C99 in its entirety, relying on inline > > > > support is realistic. > > > > > > > > I think, independent from atomics, the readability & maintainability win > > > > by relying on inline functions instead of long macros, potentially with > > > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is > > > > significant. > > > > > > Oh, man, my fastgetattr() macro is going to be simplified. All my good > > > work gets rewritten. ;-) > > > > That and HeapKeyTest() alone are sufficient reason for this ;) > > Has there been any performance testing on this rewrite to use atomics? > If so, can I missed it. Do you mean inline? Or atomics? If the former no, if the latter yes. I've started on it because of http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 11:21:06PM +0100, Andres Freund wrote: > On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote: > > On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote: > > > On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote: > > > > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote: > > > > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote: > > > > > > > I don't have an informed opinion about requiring inline support > > > > > > > (although it would surely be nice). > > > > > > > > > > > > inline is C99, and we've generally resisted requiring C99 features. > > > > > > Maybe it's time to move that goalpost, and maybe not. > > > > > > > > > > But it's a part of C99 that was very widely implemented before, so even > > > > > if we don't want to rely on C99 in its entirety, relying on inline > > > > > support is realistic. > > > > > > > > > > I think, independent from atomics, the readability & maintainability win > > > > > by relying on inline functions instead of long macros, potentially with > > > > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is > > > > > significant. > > > > > > > > Oh, man, my fastgetattr() macro is going to be simplified. All my good > > > > work gets rewritten. ;-) > > > > > > That and HeapKeyTest() alone are sufficient reason for this ;) > > > > Has there been any performance testing on this rewrite to use atomics? > > If so, can I missed it. > > Do you mean inline? Or atomics? If the former no, if the latter > yes. I've started on it because of > http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de Yes, I was wondering about atomics. I think we know the performance characteristics of inlining. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2013-11-19 17:25:21 -0500, Bruce Momjian wrote: > On Tue, Nov 19, 2013 at 11:21:06PM +0100, Andres Freund wrote: > > On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote: > > > On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote: > > Do you mean inline? Or atomics? If the former no, if the latter > > yes. I've started on it because of > > http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de > > Yes, I was wondering about atomics. I think we know the performance > characteristics of inlining. In that case it really depends on what we do with the atomics, providing the abstraction itself shouldn't change performance at all, but the possible scalability wins of using them in some critical paths are pretty huge. From a prototype I have, ontop of the number in the above post, removing the spinlock from PinBuffer and UnpinBuffer() (only the !BM_PIN_COUNT_WAITER path) gives another factor of two on the used machine, although that required limiting MAX_BACKENDS to 0xfffff instead of the current 0x7fffff. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-19 10:37:35 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > The only animal we have that doesn't support quiet inlines today is > > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able > > to simply suppress the warning there. > > Or just not worry about it, if it's only a warning? So, my suggestion on that end is that we remove the requirement for quiet inline now and see if it that has any negative consequences on the buildfarm for a week or so. Imo that's a good idea regardless of us relying on inlining support. Does anyone have anything against that plan? If not, I'll prepare a patch. > Or does the warning > mean code bloat (lots of useless copies of the inline function)? After thinking on that for a bit, yes that's a possible consequence, but it's quite possible that it happens in cases where we don't get the warning too, so I don't think that argument has too much bearing as I don't recall a complaint about it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
This patch didn't make it out of the 2013-11 commit fest. You should move it to the next commit fest (probably with an updated patch) before January 15th.
On Thu, Dec 5, 2013 at 6:39 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-19 10:37:35 -0500, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >> > The only animal we have that doesn't support quiet inlines today is >> > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able >> > to simply suppress the warning there. >> >> Or just not worry about it, if it's only a warning? > > So, my suggestion on that end is that we remove the requirement for > quiet inline now and see if it that has any negative consequences on the > buildfarm for a week or so. Imo that's a good idea regardless of us > relying on inlining support. > Does anyone have anything against that plan? If not, I'll prepare a > patch. > >> Or does the warning >> mean code bloat (lots of useless copies of the inline function)? > > After thinking on that for a bit, yes that's a possible consequence, but > it's quite possible that it happens in cases where we don't get the > warning too, so I don't think that argument has too much bearing as I > don't recall a complaint about it. But I bet the warning we're getting there is complaining about exactly that thing. It's possible that it means "warning: i've optimized away your unused function into nothing" but I think it's more likely that it means "warning: i just emitted dead code". Indeed, I suspect that this is why that test is written that way in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-23 15:04:08 -0500, Robert Haas wrote: > On Thu, Dec 5, 2013 at 6:39 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-11-19 10:37:35 -0500, Tom Lane wrote: > >> Or does the warning > >> mean code bloat (lots of useless copies of the inline function)? > > > > After thinking on that for a bit, yes that's a possible consequence, but > > it's quite possible that it happens in cases where we don't get the > > warning too, so I don't think that argument has too much bearing as I > > don't recall a complaint about it. > > But I bet the warning we're getting there is complaining about exactly > that thing. It's possible that it means "warning: i've optimized away > your unused function into nothing" but I think it's more likely that > it means "warning: i just emitted dead code". Indeed, I suspect that > this is why that test is written that way in the first place. I don't think that's particularly likely - remember, we're talking about static inline functions here. So the compiler would have to detect that there's an inline function which won't be used and then explicitly emit a function body. That seems like an odd thing to do. So, if there's compilers like that around, imo they have to deal with their own problems. It's not like it will prohibit using postgres. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: > The attached patches compile and make check successfully on linux x86, > amd64 and msvc x86 and amd64. This time and updated configure is > included. The majority of operations still rely on CAS emulation. Note that the atomics-generic-acc.h file has ® characters in the Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird characters, I think you should make sure they're UTF-8 Regards, Marti
On Sun, Jan 19, 2014 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote: > On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> The attached patches compile and make check successfully on linux x86, >> amd64 and msvc x86 and amd64. This time and updated configure is >> included. The majority of operations still rely on CAS emulation. > > Note that the atomics-generic-acc.h file has ® characters in the > Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird > characters, I think you should make sure they're UTF-8 I think we're better off avoiding them altogether. What's wrong with (R)? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-01-21 10:20:35 -0500, Robert Haas wrote: > On Sun, Jan 19, 2014 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote: > > On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> The attached patches compile and make check successfully on linux x86, > >> amd64 and msvc x86 and amd64. This time and updated configure is > >> included. The majority of operations still rely on CAS emulation. > > > > Note that the atomics-generic-acc.h file has ® characters in the > > Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird > > characters, I think you should make sure they're UTF-8 > > I think we're better off avoiding them altogether. What's wrong with (R)? Nothing at all. That was just the copied title from the pdf, it's gone now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, [sorry for the second copy Robert] Attached is a new version of the atomic operations patch. Lots has changed since the last post: * gcc, msvc work. acc, xlc, sunpro have blindly written support which should be relatively easy to fix up. All try to implement TAS, 32 bit atomic add, 32 bit compare exchange; some do it for 64bit as well. * x86 gcc inline assembly means that we support every gcc version on x86 >= i486; even when the __sync APIs aren't available yet. * 'inline' support isn't required anymore. We fall back to ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery. * When the current platform doesn't have support for atomic operations a spinlock backed implementation is used. This can be forced using --disable-atomics. That even works when semaphores are used to implement spinlocks (a separate array is used there to avoid nesting problems). It contrast to an earlier implementation this even works when atomics are mapped to different addresses in individual processes (think dsm). * s_lock.h falls back to the atomics.h provided APIs iff it doesn't have native support for the current platform. This can be forced by defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler intrinsics based implementations that should make it easier to bring up postgres on different platfomrs. * I tried to improve the documentation of the facilities in src/include/storage/atomics.h. Including documentation of the various barrier semantics. * There's tests in regress.c that get call via a SQL function from the regression tests. * Lots of other details changed, but that's the major pieces. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 06/25/2014 10:14 AM, Andres Freund wrote: > Hi, > > [sorry for the second copy Robert] > > Attached is a new version of the atomic operations patch. Lots has > changed since the last post: Is this at a state where we can performance-test it yet? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2014-06-25 10:39:53 -0700, Josh Berkus wrote: > On 06/25/2014 10:14 AM, Andres Freund wrote: > > Hi, > > > > [sorry for the second copy Robert] > > > > Attached is a new version of the atomic operations patch. Lots has > > changed since the last post: > > Is this at a state where we can performance-test it yet? Well, this patch itself won't have a performance impact at all. It's about proving the infrastructure to use atomic operations in further patches in a compiler and platform independent way. I'll repost a new version of the lwlocks patch (still fighting an issue I introduced while rebasing ntop of Heikki's xlog scalability/lwlock merger) soon. Addressing the review comments Amit has made since. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 25, 2014 at 1:14 PM, Andres Freund <andres@2ndquadrant.com> wrote: > [sorry for the second copy Robert] > > Attached is a new version of the atomic operations patch. Lots has > changed since the last post: > > * gcc, msvc work. acc, xlc, sunpro have blindly written support which > should be relatively easy to fix up. All try to implement TAS, 32 bit > atomic add, 32 bit compare exchange; some do it for 64bit as well. > > * x86 gcc inline assembly means that we support every gcc version on x86 > >= i486; even when the __sync APIs aren't available yet. > > * 'inline' support isn't required anymore. We fall back to > ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery. > > * When the current platform doesn't have support for atomic operations a > spinlock backed implementation is used. This can be forced using > --disable-atomics. > > That even works when semaphores are used to implement spinlocks (a > separate array is used there to avoid nesting problems). It contrast > to an earlier implementation this even works when atomics are mapped > to different addresses in individual processes (think dsm). > > * s_lock.h falls back to the atomics.h provided APIs iff it doesn't have > native support for the current platform. This can be forced by > defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler > intrinsics based implementations that should make it easier to bring > up postgres on different platfomrs. > > * I tried to improve the documentation of the facilities in > src/include/storage/atomics.h. Including documentation of the various > barrier semantics. > > * There's tests in regress.c that get call via a SQL function from the > regression tests. > > * Lots of other details changed, but that's the major pieces. Review: - The changes to spin.c include unnecessary whitespace adjustments. - The new argument to s_init_lock_sema() isn't used. - "on top" is two words, not one ("ontop"). - The changes to pg_config_manual.h add a superfluous blank line. - Are the regression tests in regress.c likely to catch anything? Really? - The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which seems to be testing for __builtin_constant_p. I don't see that being used in the patch anywhere, though; and also, why doesn't the naming match? - s_lock.h adds some fallback code to use atomics to define TAS on platforms where GCC supports atomics but where we do not have a working TAS implementation. However, this supposed fallback code defines HAS_TEST_AND_SET unconditionally, so I think it will get used even if we don't have (or have disabled via configure) atomics. - I completely loathe the file layout you've proposed in src/include/storage. If we're going to have separate #include files for each architecture (and I'd rather we didn't go that route, because it's messy and unlike what we have now), then shouldn't that stuff go in src/include/port or a new directory src/include/port/atomics? Polluting src/include/storage with a whole bunch of files called atomics-whatever.h strikes me as completely ugly, and there's a lot of duplicate code in there. - What's the point of defining pg_read_barrier() to be pg_read_barrier_impl() and pg_write_barrier() to be pg_write_barrier_impl() and so forth? That seems like unnecessary notation. - Should SpinlockSemaInit() be assigning SpinlockSemas() to a local variable instead of re-executing it every time? Granted, the compiler might inline this somehow, but... More generally, I'm really wondering if you're making the compare-and-swap implementation a lot more complicated than it needs to be. How much would we lose if we supported compiler intrinsics on versions of GCC and MSVC that have it and left everything else to future patches? I suspect this patch could be about 20% of its current size and give us everything that we need. I've previously opposed discarding s_lock.h on the grounds that it's extremely well battle-tested, and if we change it, we might introduce subtle bugs that are dependent on GCC versions and such. But now that compiler intrinsics are relatively common, I don't really see any reason for us to provide our own assembler versions of *new* primitives we want to use. As long as we have some kind of wrapper functions or macros, we retain the *option* to add workarounds for compiler bugs or lack of compiler support on platforms, but it seems an awful lot simpler to me to start by assuming that that the compiler will DTRT, and only roll our own if that proves to be necessary. It's not like our hand-rolled implementations couldn't have bugs - which is different from s_lock.h, where we have evidence that the exact coding in that file does or did work on those platforms. Similarly, I would rip out - or separate into a separate patch - the code that tries to use atomic operations to implement TAS if we don't already have an implementation in s_lock.h. That might be worth doing, if there are any common architectures that don't already have TAS, or to simplify porting to new platforms, but it seems like a separate thing from what this patch is mainly about, which is enabling the use of CAS and related operations on platforms that support them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2014-06-25 15:54:37 -0400, Robert Haas wrote: > - The new argument to s_init_lock_sema() isn't used. It's used when atomics fallback to spinlocks which fall back to semaphores. c.f. atomics.c. Since it better be legal to manipulate a atomic variable while holding a spinlock we cannot simply use an arbitrary spinlock as backing for atomics. That'd possibly cause us to wait on ourselves or cause deadlocks. > - Are the regression tests in regress.c likely to catch anything? > Really? They already cought several bugs. If the operations were immediately widely used they probably wouldn't be necessary. > - The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which > seems to be testing for __builtin_constant_p. I don't see that being > used in the patch anywhere, though; and also, why doesn't the naming > match? Uh. that's a rebase screweup. Should entirely be gone. > - s_lock.h adds some fallback code to use atomics to define TAS on > platforms where GCC supports atomics but where we do not have a > working TAS implementation. However, this supposed fallback code > defines HAS_TEST_AND_SET unconditionally, so I think it will get used > even if we don't have (or have disabled via configure) atomics. Hm. Good point. It should protect against that. > - I completely loathe the file layout you've proposed in > src/include/storage. If we're going to have separate #include files > for each architecture (and I'd rather we didn't go that route, because > it's messy and unlike what we have now), then shouldn't that stuff go > in src/include/port or a new directory src/include/port/atomics? > Polluting src/include/storage with a whole bunch of files called > atomics-whatever.h strikes me as completely ugly, and there's a lot of > duplicate code in there. I don't mind moving it somewhere else and it's easy enough to change. > - What's the point of defining pg_read_barrier() to be > pg_read_barrier_impl() and pg_write_barrier() to be > pg_write_barrier_impl() and so forth? That seems like unnecessary > notation. We could forgo it for pg_read_barrier() et al, but for the actual atomics it's useful because it allows to put common checks in the !impl routines. > - Should SpinlockSemaInit() be assigning SpinlockSemas() to a local > variable instead of re-executing it every time? Granted, the compiler > might inline this somehow, but... I sure hope it will, but still a good idea. > More generally, I'm really wondering if you're making the > compare-and-swap implementation a lot more complicated than it needs > to be. Possible. > How much would we lose if we supported compiler intrinsics on > versions of GCC and MSVC that have it and left everything else to > future patches? The discussion so far seemed pretty clear that we can't regress somewhat frequently used platforms. And dropping support for all but msvc and gcc would end up doing that. We're going to have to do the legword for the most common platforms... Note that I didn't use assembler for those, but relied on intrinsics... > I suspect this patch could be about 20% of its current size and give > us everything that we need. I think the only way to do that would be to create a s_lock.h like maze. Which imo is a utterly horrible idea. I'm pretty sure there'll be more assembler implementations for 9.5 and unless we think ahead of how to structure that we'll create something bad. I also think that a year or so down the road we're going to support more operations. E.g. optional support for a 16byte CAS can allow for some awesome things when you want to do lockless stuff on a 64bit platform. I think there's chances for reducing the size by merging i386/amd64, that looked better earlier than it does now (due to the addition of the inline assembler). > I've previously > opposed discarding s_lock.h on the grounds that it's extremely well > battle-tested, and if we change it, we might introduce subtle bugs > that are dependent on GCC versions and such. I think we should entirely get rid of s_lock.h. It's an unmaintainable mess. That's what I'd done in an earlier version of the patch, but you'd argued against it. I think you're right in that it shouldn't be done in the same patch and possibly not even in the same cycle, but I think we better do it sometime not too far away. I e.g. don't see how we can guarantee sane barrier semantics with the current implementation. > But now that compiler > intrinsics are relatively common, I don't really see any reason for us > to provide our own assembler versions of *new* primitives we want to > use. As long as we have some kind of wrapper functions or macros, we > retain the *option* to add workarounds for compiler bugs or lack of > compiler support on platforms, but it seems an awful lot simpler to me > to start by assuming that that the compiler will DTRT, and only roll > our own if that proves to be necessary. It's not like our hand-rolled > implementations couldn't have bugs - which is different from s_lock.h, > where we have evidence that the exact coding in that file does or did > work on those platforms. I added the x86 inline assembler because a fair number of buildfarm animals use ancient gcc's and because I could easily test it. It's also more efficient for gcc < 4.6. I'm not wedded to keeping it. > Similarly, I would rip out - or separate into a separate patch - the > code that tries to use atomic operations to implement TAS if we don't > already have an implementation in s_lock.h. That might be worth > doing, if there are any common architectures that don't already have > TAS, or to simplify porting to new platforms, but it seems like a > separate thing from what this patch is mainly about, which is enabling > the use of CAS and related operations on platforms that support them. Happy to separate that out. It's also very useful to test code btw... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/25/2014 11:36 PM, Andres Freund wrote: > >> >- I completely loathe the file layout you've proposed in >> >src/include/storage. If we're going to have separate #include files >> >for each architecture (and I'd rather we didn't go that route, because >> >it's messy and unlike what we have now), then shouldn't that stuff go >> >in src/include/port or a new directory src/include/port/atomics? >> >Polluting src/include/storage with a whole bunch of files called >> >atomics-whatever.h strikes me as completely ugly, and there's a lot of >> >duplicate code in there. > I don't mind moving it somewhere else and it's easy enough to change. I think having a separate file for each architecture is nice. I totally agree that they don't belong in src/include/storage, though. s_lock.h has always been misplaced there, but we've let it be for historical reasons, but now that we're adding a dozen new files, it's time to move them out. - Heikki
Heikki Linnakangas wrote: > On 06/25/2014 11:36 PM, Andres Freund wrote: > > > >>>- I completely loathe the file layout you've proposed in > >>>src/include/storage. If we're going to have separate #include files > >>>for each architecture (and I'd rather we didn't go that route, because > >>>it's messy and unlike what we have now), then shouldn't that stuff go > >>>in src/include/port or a new directory src/include/port/atomics? > >>>Polluting src/include/storage with a whole bunch of files called > >>>atomics-whatever.h strikes me as completely ugly, and there's a lot of > >>>duplicate code in there. > >I don't mind moving it somewhere else and it's easy enough to change. > > I think having a separate file for each architecture is nice. I > totally agree that they don't belong in src/include/storage, though. > s_lock.h has always been misplaced there, but we've let it be for > historical reasons, but now that we're adding a dozen new files, > it's time to move them out. We also have a bunch of files in src/backend/storage/lmgr, both current and introduced by this patch, which would be good to relocate. (Really, how did lmgr/ end up in storage/ in the first place?) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Since it better be legal to manipulate a atomic variable while holding a > spinlock we cannot simply use an arbitrary spinlock as backing for > atomics. That'd possibly cause us to wait on ourselves or cause > deadlocks. I think that's going to fall afoul of Tom's previously-articulated "no loops inside spinlocks" rule. Most atomics, by nature, are loop-until-it-works. >> How much would we lose if we supported compiler intrinsics on >> versions of GCC and MSVC that have it and left everything else to >> future patches? > > The discussion so far seemed pretty clear that we can't regress somewhat > frequently used platforms. And dropping support for all but msvc and gcc > would end up doing that. We're going to have to do the legword for the > most common platforms... Note that I didn't use assembler for those, but > relied on intrinsics... We can't *regress* somewhat-frequently used platforms, but that's different from saying we've got to support *new* facilities on those platforms. Essentially the entire buildfarm is running either gcc, some Microsoft compiler, or a compiler that's supports the same atomics intrinsics as gcc i.e. icc or clang. Some of those compilers may be too old to support the atomics intrinsics, and there's one Sun Studio animal, but I don't know that we need to care about those things in this patch... ...unless of course the atomics fallbacks in this patch are sufficiently sucky that anyone who ends up using those is going to be sad. Then the bar is a lot higher. But if that's the case then I wonder if we're really on the right course here. > I added the x86 inline assembler because a fair number of buildfarm > animals use ancient gcc's and because I could easily test it. It's also > more efficient for gcc < 4.6. I'm not wedded to keeping it. Hmm. gcc 4.6 is only just over a year old, so if pre-4.6 implementations aren't that good, that's a pretty good argument for keeping our own implementations around. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I think having a separate file for each architecture is nice. I totally > agree that they don't belong in src/include/storage, though. s_lock.h has > always been misplaced there, but we've let it be for historical reasons, but > now that we're adding a dozen new files, it's time to move them out. I find the current organization pretty confusing, but maybe that could be solved by better documentation of what's supposed to go in each architecture or compiler-dependent file. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-25 20:16:08 -0400, Robert Haas wrote: > On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Since it better be legal to manipulate a atomic variable while holding a > > spinlock we cannot simply use an arbitrary spinlock as backing for > > atomics. That'd possibly cause us to wait on ourselves or cause > > deadlocks. > > I think that's going to fall afoul of Tom's previously-articulated "no > loops inside spinlocks" rule. Most atomics, by nature, are > loop-until-it-works. Well, so is TAS itself :). More seriously, I think we're not going to have much fun if we're making up the rule that you can't do an atomic add/sub while a spinlock is held. That just precludes to many use cases and will make the code much harder to understand. I don't think we're going to end up having many problems if we allow atomic read/add/sub/write in there. > >> How much would we lose if we supported compiler intrinsics on > >> versions of GCC and MSVC that have it and left everything else to > >> future patches? > > > > The discussion so far seemed pretty clear that we can't regress somewhat > > frequently used platforms. And dropping support for all but msvc and gcc > > would end up doing that. We're going to have to do the legword for the > > most common platforms... Note that I didn't use assembler for those, but > > relied on intrinsics... > > We can't *regress* somewhat-frequently used platforms, but that's > different from saying we've got to support *new* facilities on those > platforms. Well, we want to use the new facilities somewhere. And it's not entirely unfair to call it a regression if other os/compiler/arch combinations speed up but yours don't. > Essentially the entire buildfarm is running either gcc, > some Microsoft compiler, or a compiler that's supports the same > atomics intrinsics as gcc i.e. icc or clang. Some of those compilers > may be too old to support the atomics intrinsics, and there's one Sun > Studio animal, but I don't know that we need to care about those > things in this patch... I think we should support those where it's easy and where we'll see the breakage on the buildfarm. Sun studio's intrinsics are simple enough that I don't think my usage of them will be too hard to fix. > ...unless of course the atomics fallbacks in this patch are > sufficiently sucky that anyone who ends up using those is going to be > sad. Then the bar is a lot higher. But if that's the case then I > wonder if we're really on the right course here. I don't you'll notice it much on mostly nonconcurrent uses. But some of those architectures/platforms do support larger NUMA like setups. And for those it'd probably hurt for some workloads. And e.g. solaris is still fairly common for such setups. > > I added the x86 inline assembler because a fair number of buildfarm > > animals use ancient gcc's and because I could easily test it. It's also > > more efficient for gcc < 4.6. I'm not wedded to keeping it. > > Hmm. gcc 4.6 is only just over a year old, so if pre-4.6 > implementations aren't that good, that's a pretty good argument for > keeping our own implementations around. :-( The difference isn't that big. It's return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, __ATOMIC_SEQ_CST,__ATOMIC_SEQ_CST); vs bool ret; uint64 current; current = __sync_val_compare_and_swap(&ptr->value, *expected, newval); ret = current == *expected; *expected = current; return ret; On x86 that's an additional cmp, mov and such. Useless pos because cmpxchg already computes all that... (that's returned by the setz in my patch). MSVC only supports the second form as well. There's a fair number of < 4.2 gccs on the buildfarm as well though. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-06-25 20:22:31 -0400, Robert Haas wrote: > On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > > I think having a separate file for each architecture is nice. I totally > > agree that they don't belong in src/include/storage, though. s_lock.h has > > always been misplaced there, but we've let it be for historical reasons, but > > now that we're adding a dozen new files, it's time to move them out. > > I find the current organization pretty confusing, but maybe that could > be solved by better documentation of what's supposed to go in each > architecture or compiler-dependent file. The idea is that first a architecture specific file (atomics-arch-*.h) is included. That file can provide a (partial) implementation for the specific architecture. Or it can do pretty much nothing. After that a compiler specific file is included (atomics-generic-*.h). If atomics aren't yet implemented that can provide an intrinsics based implementation if the compiler version has support for it. At the very least a compiler barrier should be provided. After that the spinlock based fallback implementation (atomics-fallback.h) provides atomics and barriers if not yet available. By here we're sure that nothing else will provide them. Then we can provide operations (atomics-generic.h) that build ontop of the provided functions. E.g. implement _sub, _and et al. I'll include some more of that explanation in the header. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-25 20:16:08 -0400, Robert Haas wrote: >> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Since it better be legal to manipulate a atomic variable while holding a >> > spinlock we cannot simply use an arbitrary spinlock as backing for >> > atomics. That'd possibly cause us to wait on ourselves or cause >> > deadlocks. >> >> I think that's going to fall afoul of Tom's previously-articulated "no >> loops inside spinlocks" rule. Most atomics, by nature, are >> loop-until-it-works. > > Well, so is TAS itself :). > > More seriously, I think we're not going to have much fun if we're making > up the rule that you can't do an atomic add/sub while a spinlock is > held. That just precludes to many use cases and will make the code much > harder to understand. I don't think we're going to end up having many > problems if we allow atomic read/add/sub/write in there. That rule seems reasonable -- why would you ever want to do this? While you couldn't properly deadlock it seems like it could lead to unpredictable and hard to diagnose performance stalls. merlin
On 2014-06-26 07:44:14 -0500, Merlin Moncure wrote: > On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-06-25 20:16:08 -0400, Robert Haas wrote: > >> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > Since it better be legal to manipulate a atomic variable while holding a > >> > spinlock we cannot simply use an arbitrary spinlock as backing for > >> > atomics. That'd possibly cause us to wait on ourselves or cause > >> > deadlocks. > >> > >> I think that's going to fall afoul of Tom's previously-articulated "no > >> loops inside spinlocks" rule. Most atomics, by nature, are > >> loop-until-it-works. > > > > Well, so is TAS itself :). > > > > More seriously, I think we're not going to have much fun if we're making > > up the rule that you can't do an atomic add/sub while a spinlock is > > held. That just precludes to many use cases and will make the code much > > harder to understand. I don't think we're going to end up having many > > problems if we allow atomic read/add/sub/write in there. > > That rule seems reasonable -- why would you ever want to do this? Are you wondering why you'd ever manipulate an atomic op inside a spinlock? There's enough algorithms where a slowpath is done under a spinlock but the fastpath is done without. You can't simply use nonatomic operations for manipulation under the spinlock because the fastpaths might then observe/cause bogus state. Obviously I'm not advocating to do random stuff in spinlocks. W're talking about single lock xadd;s (or the LL/SC) equivalent here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 26, 2014 at 12:20:06PM +0200, Andres Freund wrote: > On 2014-06-25 20:16:08 -0400, Robert Haas wrote: > > On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > Since it better be legal to manipulate a atomic variable while holding a > > > spinlock we cannot simply use an arbitrary spinlock as backing for > > > atomics. That'd possibly cause us to wait on ourselves or cause > > > deadlocks. > > > > I think that's going to fall afoul of Tom's previously-articulated "no > > loops inside spinlocks" rule. Most atomics, by nature, are > > loop-until-it-works. > > Well, so is TAS itself :). > > More seriously, I think we're not going to have much fun if we're making > up the rule that you can't do an atomic add/sub while a spinlock is > held. That just precludes to many use cases and will make the code much > harder to understand. I don't think we're going to end up having many > problems if we allow atomic read/add/sub/write in there. I agree it's valuable to have a design that permits invoking an atomic operation while holding a spinlock. > > > I added the x86 inline assembler because a fair number of buildfarm > > > animals use ancient gcc's and because I could easily test it. It's also > > > more efficient for gcc < 4.6. I'm not wedded to keeping it. > > > > Hmm. gcc 4.6 is only just over a year old, so if pre-4.6 > > implementations aren't that good, that's a pretty good argument for > > keeping our own implementations around. :-( GCC 4.6 was released 2011-03-25, though that doesn't change the bottom line. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-25 20:16:08 -0400, Robert Haas wrote: >> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Since it better be legal to manipulate a atomic variable while holding a >> > spinlock we cannot simply use an arbitrary spinlock as backing for >> > atomics. That'd possibly cause us to wait on ourselves or cause >> > deadlocks. >> >> I think that's going to fall afoul of Tom's previously-articulated "no >> loops inside spinlocks" rule. Most atomics, by nature, are >> loop-until-it-works. > > Well, so is TAS itself :). Yeah, which is why we have a rule that you're not suppose to acquire another spinlock while already holding one. You're trying to make the spinlocks used to simulate atomic ops an exception to that rule, but I'm not sure that's OK. An atomic op is a lot more expensive than a regular unlocked load or store even if it doesn't loop, and if it does loop, it's worse, potentially by a large multiple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 26, 2014 at 7:21 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-25 20:22:31 -0400, Robert Haas wrote: >> On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >> > I think having a separate file for each architecture is nice. I totally >> > agree that they don't belong in src/include/storage, though. s_lock.h has >> > always been misplaced there, but we've let it be for historical reasons, but >> > now that we're adding a dozen new files, it's time to move them out. >> >> I find the current organization pretty confusing, but maybe that could >> be solved by better documentation of what's supposed to go in each >> architecture or compiler-dependent file. > > The idea is that first a architecture specific file (atomics-arch-*.h) > is included. That file can provide a (partial) implementation for the > specific architecture. Or it can do pretty much nothing. > > After that a compiler specific file is included > (atomics-generic-*.h). If atomics aren't yet implemented that can > provide an intrinsics based implementation if the compiler version has > support for it. At the very least a compiler barrier should be provided. > > After that the spinlock based fallback implementation > (atomics-fallback.h) provides atomics and barriers if not yet > available. By here we're sure that nothing else will provide them. > > Then we can provide operations (atomics-generic.h) that build ontop of > the provided functions. E.g. implement _sub, _and et al. > > I'll include some more of that explanation in the header. I get the general principle, but I think it would be good to have one place that says something like: Each architecture must provide A and B, may provide both or neither of C and D, and may also any or all of E, F, and G. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2014-06-25 20:16:08 -0400, Robert Haas wrote: >>> I think that's going to fall afoul of Tom's previously-articulated "no >>> loops inside spinlocks" rule. Most atomics, by nature, are >>> loop-until-it-works. >> Well, so is TAS itself :). > Yeah, which is why we have a rule that you're not suppose to acquire > another spinlock while already holding one. You're trying to make the > spinlocks used to simulate atomic ops an exception to that rule, but > I'm not sure that's OK. An atomic op is a lot more expensive than a > regular unlocked load or store even if it doesn't loop, and if it does > loop, it's worse, potentially by a large multiple. Well, the point here is to be sure that there's a reasonably small bound on how long we hold the spinlock. It doesn't have to be zero, just small. Would it be practical to say that the coding rule is that you can only use atomic ops on fields that are protected by the spinlock, ie, nobody else is supposed to be touching those fields while you have the spinlock? If that's the case, then the atomic op should see no contention and thus not take very long. On the other hand, if the atomic op is touching something not protected by the spinlock, that seems to me to be morally equivalent to taking a spinlock while holding another one, which as Robert says is forbidden by our current coding rules, and for very good reasons IMO. However, that may be safe only as long as we have real atomic ops. If we're simulating those using spinlocks then you have to ask questions about how many underlying spinlocks there are and whether artificial contention might ensue (due to the same spinlock being used for multiple things). regards, tom lane
On 2014-06-26 11:47:15 -0700, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> On 2014-06-25 20:16:08 -0400, Robert Haas wrote: > >>> I think that's going to fall afoul of Tom's previously-articulated "no > >>> loops inside spinlocks" rule. Most atomics, by nature, are > >>> loop-until-it-works. > > >> Well, so is TAS itself :). > > > Yeah, which is why we have a rule that you're not suppose to acquire > > another spinlock while already holding one. You're trying to make the > > spinlocks used to simulate atomic ops an exception to that rule, but > > I'm not sure that's OK. An atomic op is a lot more expensive than a > > regular unlocked load or store even if it doesn't loop, and if it does > > loop, it's worse, potentially by a large multiple. > > Well, the point here is to be sure that there's a reasonably small bound > on how long we hold the spinlock. It doesn't have to be zero, just small. > > Would it be practical to say that the coding rule is that you can only > use atomic ops on fields that are protected by the spinlock, ie, nobody > else is supposed to be touching those fields while you have the spinlock? > If that's the case, then the atomic op should see no contention and thus > not take very long. I don't really see usecases where it's not related data that's being touched, but: The point is that the fastpath (not using a spinlock) might touch the atomic variable, even while the slowpath has acquired the spinlock. So the slowpath can't just use non-atomic operations on the atomic variable. Imagine something like releasing a buffer pin while somebody else is doing something that requires holding the buffer header spinlock. > On the other hand, if the atomic op is touching something not protected > by the spinlock, that seems to me to be morally equivalent to taking a > spinlock while holding another one, which as Robert says is forbidden > by our current coding rules, and for very good reasons IMO. > > However, that may be safe only as long as we have real atomic ops. > If we're simulating those using spinlocks then you have to ask questions > about how many underlying spinlocks there are and whether artificial > contention might ensue (due to the same spinlock being used for multiple > things). With the code as submitted it's one spinlock per atomic variable. Which is fine until spinlocks are emulated using semaphores - in that case a separate array of semaphores (quite small in the patch as submitted) is used so there's no chance of deadlocks against a 'real' spinlock or anything like that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 26, 2014 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > Would it be practical to say that the coding rule is that you can only > use atomic ops on fields that are protected by the spinlock, ie, nobody > else is supposed to be touching those fields while you have the spinlock? > If that's the case, then the atomic op should see no contention and thus > not take very long. I wonder if this is true in all cases. The address you are locking might be logically protected but at the same time nearby to other memory that is under contention. In other words, I don't think you can assume an atomic op locks exactly 4 bytes of memory for the operation. > On the other hand, if the atomic op is touching something not protected > by the spinlock, that seems to me to be morally equivalent to taking a > spinlock while holding another one, which as Robert says is forbidden > by our current coding rules, and for very good reasons IMO. Well not quite: you don't have the possibility of deadlock. merlin
On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't really see usecases where it's not related data that's being > touched, but: The point is that the fastpath (not using a spinlock) might > touch the atomic variable, even while the slowpath has acquired the > spinlock. So the slowpath can't just use non-atomic operations on the > atomic variable. > Imagine something like releasing a buffer pin while somebody else is > doing something that requires holding the buffer header spinlock. If the atomic variable can be manipulated without the spinlock under *any* circumstances, then how is it a good idea to ever manipulate it with the spinlock? That seems hard to reason about, and unnecessary. Critical sections for spinlocks should be short and contain only the instructions that need protection, and clearly the atomic op is not one of those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-27 13:15:25 -0400, Robert Haas wrote: > On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > I don't really see usecases where it's not related data that's being > > touched, but: The point is that the fastpath (not using a spinlock) might > > touch the atomic variable, even while the slowpath has acquired the > > spinlock. So the slowpath can't just use non-atomic operations on the > > atomic variable. > > Imagine something like releasing a buffer pin while somebody else is > > doing something that requires holding the buffer header spinlock. > > If the atomic variable can be manipulated without the spinlock under > *any* circumstances, then how is it a good idea to ever manipulate it > with the spinlock? That seems hard to reason about, and unnecessary. > Critical sections for spinlocks should be short and contain only the > instructions that need protection, and clearly the atomic op is not > one of those. Imagine the situation for the buffer header spinlock which is one of the bigger performance issues atm. We could aim to replace all usages of that with clever and complicated logic, but it's hard. The IMO reasonable (and prototyped) way to do it is to make the common paths lockless, but fall back to the spinlock for the more complicated situations. For the buffer header that means that pin/unpin and buffer lookup are lockless, but IO and changing the identity of a buffer still require the spinlock. My attempts to avoid the latter basically required a buffer header specific reimplementation of spinlocks. To make pin/unpin et al safe without acquiring the spinlock the pincount and the flags had to be moved into one variable so that when incrementing the pincount you automatically get the flagbits back. If it's currently valid all is good, otherwise the spinlock needs to be acquired. But for that to work you need to set flags while the spinlock is held. Possibly you can come up with ways to avoid that, but I am pretty damn sure that the code will be less robust by forbidding atomics inside a critical section, not the contrary. It's a good idea to avoid it, but not at all costs. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-27 13:15:25 -0400, Robert Haas wrote: >> On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > I don't really see usecases where it's not related data that's being >> > touched, but: The point is that the fastpath (not using a spinlock) might >> > touch the atomic variable, even while the slowpath has acquired the >> > spinlock. So the slowpath can't just use non-atomic operations on the >> > atomic variable. >> > Imagine something like releasing a buffer pin while somebody else is >> > doing something that requires holding the buffer header spinlock. >> >> If the atomic variable can be manipulated without the spinlock under >> *any* circumstances, then how is it a good idea to ever manipulate it >> with the spinlock? That seems hard to reason about, and unnecessary. >> Critical sections for spinlocks should be short and contain only the >> instructions that need protection, and clearly the atomic op is not >> one of those. > > Imagine the situation for the buffer header spinlock which is one of the > bigger performance issues atm. We could aim to replace all usages of > that with clever and complicated logic, but it's hard. > > The IMO reasonable (and prototyped) way to do it is to make the common > paths lockless, but fall back to the spinlock for the more complicated > situations. For the buffer header that means that pin/unpin and buffer > lookup are lockless, but IO and changing the identity of a buffer still > require the spinlock. My attempts to avoid the latter basically required > a buffer header specific reimplementation of spinlocks. > > To make pin/unpin et al safe without acquiring the spinlock the pincount > and the flags had to be moved into one variable so that when > incrementing the pincount you automatically get the flagbits back. If > it's currently valid all is good, otherwise the spinlock needs to be > acquired. But for that to work you need to set flags while the spinlock > is held. > > Possibly you can come up with ways to avoid that, but I am pretty damn > sure that the code will be less robust by forbidding atomics inside a > critical section, not the contrary. It's a good idea to avoid it, but > not at all costs. You might be right, but I'm not convinced. Does the lwlock scalability patch work this way, too? I was hoping to look at that RSN, so if there are roughly the same issues there it might shed some light on this point also. What I'm basically afraid of is that this will work fine in many cases but have really ugly failure modes. That's pretty much what happens with spinlocks already - the overhead is insignificant at low levels of contention but as the spinlock starts to become contended the CPUs all fight over the cache line and performance goes to pot. ISTM that making the spinlock critical section significantly longer by putting atomic ops in there could appear to win in some cases while being very bad in others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> Attached is a new version of the atomic operations patch. Lots has
> changed since the last post:
>
> * gcc, msvc work. acc, xlc, sunpro have blindly written support which
> should be relatively easy to fix up. All try to implement TAS, 32 bit
> atomic add, 32 bit compare exchange; some do it for 64bit as well.
>
> Attached is a new version of the atomic operations patch. Lots has
> changed since the last post:
>
> * gcc, msvc work. acc, xlc, sunpro have blindly written support which
> should be relatively easy to fix up. All try to implement TAS, 32 bit
> atomic add, 32 bit compare exchange; some do it for 64bit as well.
I have reviewed msvc part of patch and below are my findings:
1.
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+ uint32 *expected, uint32 newval)
+{
+ bool ret;
+ uint32 current;
+ current = InterlockedCompareExchange(&ptr->value, newval, *expected);
Is there a reason why using InterlockedCompareExchange() is better
than using InterlockedCompareExchangeAcquire() variant?
*Acquire or *Release variants can provide acquire memory ordering
semantics for processors which support the same else will be defined
as InterlockedCompareExchange.
2.
+pg_atomic_compare_exchange_u32_impl()
{
..
+ *expected = current;
}
Can we implement this API without changing expected value?
I mean if the InterlockedCompareExchange() is success, then it will
store the newval in ptr->value, else *ret* will be false.
I am not sure if there is anything implementation specific in changing
*expected*.
I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it. This will have another
advantage that our pg_* API will also have similar signature as native API's.
3. Is there a overhead of using Add, instead of increment/decrement
version of Interlocked?
I could not find any evidence which can clearly indicate, if one is better
than other except some recommendation in below link which suggests to
*maintain reference count*, use Increment/Decrement
Refer *The Locking Cookbook* in below link:
However, when I tried to check the instructions each function generates,
then I don't find anything which suggests that *Add variant can be slow.
Refer Instructions for both functions:
cPublicProc _InterlockedIncrement,1
cPublicFpo 1,0
mov ecx,Addend ; get pointer to addend variable
mov eax,1 ; set increment value
Lock1:
lock xadd [ecx],eax ; interlocked increment
inc eax ; adjust return value
stdRET _InterlockedIncrement ;
stdENDP _InterlockedIncrement
cPublicProc _InterlockedExchangeAdd, 2
cPublicFpo 2,0
mov ecx, [esp + 4] ; get addend address
mov eax, [esp + 8] ; get increment value
Lock4:
lock xadd [ecx], eax ; exchange add
stdRET _InterlockedExchangeAdd
stdENDP _InterlockedExchangeAdd
For details, refer link:
I don't think there is any need of change in current implementation,
however I just wanted to share my analysis, so that if any one else
can see any point in preferring one or the other way of implementation.
4.
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier_impl() _ReadWriteBarrier()
#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif
There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.
Please refer below link:
When I checked previous versions of MSVC, I noticed that these are
supported on x86, IPF, x64 till VS2008 and supported only on x86, x64
for VS2012 onwards.
Link for VS2010 support:
5.
+ * atomics-generic-msvc.h
..
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
Shouldn't copyright notice be 2014?
6.
pg_atomic_compare_exchange_u32()
It is better to have comments above this and all other related functions.
On 2014-06-28 11:58:41 +0530, Amit Kapila wrote: > On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > > > Attached is a new version of the atomic operations patch. Lots has > > changed since the last post: > > > > * gcc, msvc work. acc, xlc, sunpro have blindly written support which > > should be relatively easy to fix up. All try to implement TAS, 32 bit > > atomic add, 32 bit compare exchange; some do it for 64bit as well. > > I have reviewed msvc part of patch and below are my findings: Thanks for looking at it - I pretty much wrote that part blindly. > 1. > +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, > + uint32 *expected, uint32 newval) > +{ > + bool ret; > + uint32 current; > + current = InterlockedCompareExchange(&ptr->value, newval, *expected); > > Is there a reason why using InterlockedCompareExchange() is better > than using InterlockedCompareExchangeAcquire() variant? Two things: a) compare_exchange is a read/write operator and so far I've defined it to have full barrier semantics (documented in atomics.h).I'd be happy to discuss which semantics we want for which operation. b) It's only supported from vista onwards. Afaik we still support XP. c) It doesn't make any difference on x86 ;) > *Acquire or *Release variants can provide acquire memory ordering > semantics for processors which support the same else will be defined > as InterlockedCompareExchange. > 2. > +pg_atomic_compare_exchange_u32_impl() > { > .. > + *expected = current; > } > > Can we implement this API without changing expected value? > I mean if the InterlockedCompareExchange() is success, then it will > store the newval in ptr->value, else *ret* will be false. > I am not sure if there is anything implementation specific in changing > *expected*. > I think this value is required for lwlock patch, but I am wondering why > can't the same be achieved if we just return the *current* value and > then let lwlock patch do the handling based on it. This will have another > advantage that our pg_* API will also have similar signature as native > API's. Many usages of compare/exchange require that you'll get the old value back in an atomic fashion. Unfortunately the Interlocked* API doesn't provide a nice way to do that. Note that this definition of compare/exchange both maps nicely to C11's atomics and the actual x86 cmpxchg instruction. I've generally tried to mimick C11's API as far as it's possible. There's some limitations because we can't do generic types and such, but other than that. > 3. Is there a overhead of using Add, instead of increment/decrement > version of Interlocked? There's a theoretical difference for IA64 which has a special increment/decrement operation which can only change the value by a rather limited number of values. I don't think it's worth to care. > 4. > #pragma intrinsic(_ReadWriteBarrier) > #define pg_compiler_barrier_impl() _ReadWriteBarrier() > > #ifndef pg_memory_barrier_impl > #define pg_memory_barrier_impl() MemoryBarrier() > #endif > > There is a Caution notice in microsoft site indicating > _ReadWriteBarrier/MemoryBarrier are deprected. It seemed to be the most widely available API, and it's what barrier.h already used. Do you have a different suggestion? > 6. > pg_atomic_compare_exchange_u32() > > It is better to have comments above this and all other related functions. Check atomics.h, there's comments above it: /** pg_atomic_compare_exchange_u32 - CAS operation** Atomically compare the current value of ptr with *expected and storenewval* iff ptr and *expected have the same value. The current value of *ptr will* always be stored in *expected.**Return whether the values have been exchanged.** Full barrier semantics.*/ Thanks, Andres Freund
On 2014-06-27 22:47:02 -0400, Robert Haas wrote: > On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-06-27 13:15:25 -0400, Robert Haas wrote: > >> On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > I don't really see usecases where it's not related data that's being > >> > touched, but: The point is that the fastpath (not using a spinlock) might > >> > touch the atomic variable, even while the slowpath has acquired the > >> > spinlock. So the slowpath can't just use non-atomic operations on the > >> > atomic variable. > >> > Imagine something like releasing a buffer pin while somebody else is > >> > doing something that requires holding the buffer header spinlock. > >> > >> If the atomic variable can be manipulated without the spinlock under > >> *any* circumstances, then how is it a good idea to ever manipulate it > >> with the spinlock? That seems hard to reason about, and unnecessary. > >> Critical sections for spinlocks should be short and contain only the > >> instructions that need protection, and clearly the atomic op is not > >> one of those. > > > > Imagine the situation for the buffer header spinlock which is one of the > > bigger performance issues atm. We could aim to replace all usages of > > that with clever and complicated logic, but it's hard. > > > > The IMO reasonable (and prototyped) way to do it is to make the common > > paths lockless, but fall back to the spinlock for the more complicated > > situations. For the buffer header that means that pin/unpin and buffer > > lookup are lockless, but IO and changing the identity of a buffer still > > require the spinlock. My attempts to avoid the latter basically required > > a buffer header specific reimplementation of spinlocks. > > > > To make pin/unpin et al safe without acquiring the spinlock the pincount > > and the flags had to be moved into one variable so that when > > incrementing the pincount you automatically get the flagbits back. If > > it's currently valid all is good, otherwise the spinlock needs to be > > acquired. But for that to work you need to set flags while the spinlock > > is held. > > > > Possibly you can come up with ways to avoid that, but I am pretty damn > > sure that the code will be less robust by forbidding atomics inside a > > critical section, not the contrary. It's a good idea to avoid it, but > > not at all costs. > > You might be right, but I'm not convinced. Why? Anything I can do to convince you of this? Note that other users of atomics (e.g. the linux kernel) widely use atomics inside spinlock protected regions. > Does the lwlock scalability patch work this way, too? No. I've moved one add/sub inside a critical section in the current version while debugging, but it's not required at all. I generally think it makes sense to document the suggestion of moving atomics outside, but not make it a requirement. > What I'm basically afraid of is that this will work fine in many cases > but have really ugly failure modes. That's pretty much what happens > with spinlocks already - the overhead is insignificant at low levels > of contention but as the spinlock starts to become contended the CPUs > all fight over the cache line and performance goes to pot. ISTM that > making the spinlock critical section significantly longer by putting > atomic ops in there could appear to win in some cases while being very > bad in others. Well, I'm not saying it's something I suggest doing all the time. But if using an atomic op in the slow path allows you to remove the spinlock from 99% of the cases I don't see it having a significant impact. In most scenarios (where atomics aren't emulated, i.e. platforms we expect to used in heavily concurrent cases) the spinlock and the atomic will be on the same cacheline making stalls much less likely. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/27/2014 08:15 PM, Robert Haas wrote: > On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I don't really see usecases where it's not related data that's being >> touched, but: The point is that the fastpath (not using a spinlock) might >> touch the atomic variable, even while the slowpath has acquired the >> spinlock. So the slowpath can't just use non-atomic operations on the >> atomic variable. >> Imagine something like releasing a buffer pin while somebody else is >> doing something that requires holding the buffer header spinlock. > > If the atomic variable can be manipulated without the spinlock under > *any* circumstances, then how is it a good idea to ever manipulate it > with the spinlock? With the WALInsertLock scaling stuff in 9.4, there are now two variables protected by a spinlock: the current WAL insert location, and the prev pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need to grab the spinlock to update both of them atomically. But to just read the WAL insert pointer, you could do an atomic read of CurrBytePos if the architecture supports it - now you have to grab the spinlock. Admittedly that's just an atomic read, not an atomic compare and exchange or fetch-and-add. Or if the architecture has an atomic 128-bit compare & exchange op you could replace the spinlock with that. But it's not hard to imagine similar situations where you sometimes need to lock a larger data structure to modify it atomically, but sometimes you just need to modify part of it and an atomic op would suffice. I thought Andres' LWLock patch also did something like that. If the lock is not contended, you can acquire it with an atomic compare & exchange to increment the exclusive/shared counter. But to manipulate the wait queue, you need the spinlock. - Heikki
On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:
> > 1.
> > +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
> > + uint32 *expected, uint32 newval)
> > +{
> > + bool ret;
> > + uint32 current;
> > + current = InterlockedCompareExchange(&ptr->value, newval, *expected);
> >
> > Is there a reason why using InterlockedCompareExchange() is better
> > than using InterlockedCompareExchangeAcquire() variant?
>
> Two things:
> a) compare_exchange is a read/write operator and so far I've defined it
> to have full barrier semantics (documented in atomics.h). I'd be
> happy to discuss which semantics we want for which operation.
> c) It doesn't make any difference on x86 ;)
> > 2.
> > +pg_atomic_compare_exchange_u32_impl()
> > {
> > ..
> > + *expected = current;
> > }
> >
> > Can we implement this API without changing expected value?
..
> > I think this value is required for lwlock patch, but I am wondering why
> > can't the same be achieved if we just return the *current* value and
> > then let lwlock patch do the handling based on it. This will have another
> > advantage that our pg_* API will also have similar signature as native
> > API's.
>
> Many usages of compare/exchange require that you'll get the old value
> back in an atomic fashion. Unfortunately the Interlocked* API doesn't
> provide a nice way to do that.
> > 6.
> > pg_atomic_compare_exchange_u32()
> >
> > It is better to have comments above this and all other related functions.
>
> Check atomics.h, there's comments above it:
> /*
> * pg_atomic_compare_exchange_u32 - CAS operation
> *
> * Atomically compare the current value of ptr with *expected and store newval
> * iff ptr and *expected have the same value. The current value of *ptr will
> * always be stored in *expected.
> *
> * Return whether the values have been exchanged.
> *
> * Full barrier semantics.
> */
> On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:
> > 1.
> > +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
> > + uint32 *expected, uint32 newval)
> > +{
> > + bool ret;
> > + uint32 current;
> > + current = InterlockedCompareExchange(&ptr->value, newval, *expected);
> >
> > Is there a reason why using InterlockedCompareExchange() is better
> > than using InterlockedCompareExchangeAcquire() variant?
>
> Two things:
> a) compare_exchange is a read/write operator and so far I've defined it
> to have full barrier semantics (documented in atomics.h). I'd be
> happy to discuss which semantics we want for which operation.
I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile. I think it might not be sane in all cases,
example with Visual Studio 2003, volatile to volatile references are
ordered; the compiler will not re-order volatile variable access. However,
these operations could be re-ordered by the processor.
For detailed example, refer below link:
Also if we see C11 specs both load and store uses memory order as
memory_order_seq_cst by default which is same as for compare and
exchange
I have referred below link:
> b) It's only supported from vista onwards. Afaik we still support XP.
#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif
The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.
What about processors like Itanium which care about acquire/release
memory barrier semantics?
> > 2.
> > +pg_atomic_compare_exchange_u32_impl()
> > {
> > ..
> > + *expected = current;
> > }
> >
> > Can we implement this API without changing expected value?
..
> > I think this value is required for lwlock patch, but I am wondering why
> > can't the same be achieved if we just return the *current* value and
> > then let lwlock patch do the handling based on it. This will have another
> > advantage that our pg_* API will also have similar signature as native
> > API's.
>
> Many usages of compare/exchange require that you'll get the old value
> back in an atomic fashion. Unfortunately the Interlocked* API doesn't
> provide a nice way to do that.
Yes, but I think the same cannot be accomplished even by using
expected.
>Note that this definition of
> compare/exchange both maps nicely to C11's atomics and the actual x86
> cmpxchg instruction.
>
> I've generally tried to mimick C11's API as far as it's
> possible. There's some limitations because we can't do generic types and
> such, but other than that.
> > 3. Is there a overhead of using Add, instead of increment/decrement
> > version of Interlocked?
>
> There's a theoretical difference for IA64 which has a special
> increment/decrement operation which can only change the value by a
> rather limited number of values. I don't think it's worth to care.
Okay.
> > 4.
..
> > There is a Caution notice in microsoft site indicating
> > _ReadWriteBarrier/MemoryBarrier are deprected.
>
> It seemed to be the most widely available API, and it's what barrier.h
> already used.
> Do you have a different suggestion?
I am trying to think based on suggestion given by Microsoft, but
> compare/exchange both maps nicely to C11's atomics and the actual x86
> cmpxchg instruction.
>
> I've generally tried to mimick C11's API as far as it's
> possible. There's some limitations because we can't do generic types and
> such, but other than that.
If I am reading C11's spec for this API correctly, then it says as below:
"Atomically compares the value pointed to by obj with the value pointed
to by expected, and if those are equal, replaces the former with desired
(performs read-modify-write operation). Otherwise, loads the actual value
pointed to by obj into *expected (performs load operation)."
So it essentialy means that it loads actual value in expected only if
operation is not successful.
> > 3. Is there a overhead of using Add, instead of increment/decrement
> > version of Interlocked?
>
> There's a theoretical difference for IA64 which has a special
> increment/decrement operation which can only change the value by a
> rather limited number of values. I don't think it's worth to care.
Okay.
> > 4.
..
> > There is a Caution notice in microsoft site indicating
> > _ReadWriteBarrier/MemoryBarrier are deprected.
>
> It seemed to be the most widely available API, and it's what barrier.h
> already used.
> Do you have a different suggestion?
I am trying to think based on suggestion given by Microsoft, but
not completely clear how to accomplish the same at this moment.
> > 6.
> > pg_atomic_compare_exchange_u32()
> >
> > It is better to have comments above this and all other related functions.
>
> Check atomics.h, there's comments above it:
> /*
> * pg_atomic_compare_exchange_u32 - CAS operation
> *
> * Atomically compare the current value of ptr with *expected and store newval
> * iff ptr and *expected have the same value. The current value of *ptr will
> * always be stored in *expected.
> *
> * Return whether the values have been exchanged.
> *
> * Full barrier semantics.
> */
Okay, generally code has such comments on top of function
definition rather than with declaration.
On 2014-06-29 11:53:28 +0530, Amit Kapila wrote: > On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com> > > Two things: > > a) compare_exchange is a read/write operator and so far I've defined it > > to have full barrier semantics (documented in atomics.h). I'd be > > happy to discuss which semantics we want for which operation. > > I think it is better to have some discussion about it. Apart from this, > today I noticed atomic read/write doesn't use any barrier semantics > and just rely on volatile. Yes, intentionally so. It's often important to get/set the current value without the cost of emitting a memory barrier. It just has to be a recent value and it actually has to come from memory. And that's actually enforced by the current definition - I think? > > b) It's only supported from vista onwards. Afaik we still support XP. > #ifndef pg_memory_barrier_impl > #define pg_memory_barrier_impl() MemoryBarrier() > #endif > > The MemoryBarrier() macro used also supports only on vista onwards, > so I think for reasons similar to using MemoryBarrier() can apply for > this as well. I think that's odd because barrier.h has been using MemoryBarrier() without a version test for a long time now. I guess everyone uses a new enough visual studio. Those intrinsics aren't actually OS but just compiler dependent. Otherwise we could just define it as: FORCEINLINE VOID MemoryBarrier ( VOID ) { LONG Barrier; __asm { xchg Barrier, eax } } > > c) It doesn't make any difference on x86 ;) > > What about processors like Itanium which care about acquire/release > memory barrier semantics? Well, it still will be correct? I don't think it makes much sense to focus overly much on itanium here with the price of making anything more complicated for others. > > > I think this value is required for lwlock patch, but I am wondering why > > > can't the same be achieved if we just return the *current* value and > > > then let lwlock patch do the handling based on it. This will have > another > > > advantage that our pg_* API will also have similar signature as native > > > API's. > > > > Many usages of compare/exchange require that you'll get the old value > > back in an atomic fashion. Unfortunately the Interlocked* API doesn't > > provide a nice way to do that. > > Yes, but I think the same cannot be accomplished even by using > expected. More complicatedly so, yes? I don't think we want those comparisons on practically every callsite. > >Note that this definition of > > compare/exchange both maps nicely to C11's atomics and the actual x86 > > cmpxchg instruction. > > > > I've generally tried to mimick C11's API as far as it's > > possible. There's some limitations because we can't do generic types and > > such, but other than that. > > If I am reading C11's spec for this API correctly, then it says as below: > "Atomically compares the value pointed to by obj with the value pointed > to by expected, and if those are equal, replaces the former with desired > (performs read-modify-write operation). Otherwise, loads the actual value > pointed to by obj into *expected (performs load operation)." > > So it essentialy means that it loads actual value in expected only if > operation is not successful. Yes. But in the case it's successfull it will already contain the right value. > > > 4. > .. > > > There is a Caution notice in microsoft site indicating > > > _ReadWriteBarrier/MemoryBarrier are deprected. > > > > It seemed to be the most widely available API, and it's what barrier.h > > already used. > > Do you have a different suggestion? > > I am trying to think based on suggestion given by Microsoft, but > not completely clear how to accomplish the same at this moment. Well, they refer to C11 stuff. But I think it'll be a while before it makes sense to use a fallback based on that. > > > 6. > > > pg_atomic_compare_exchange_u32() > > > > > > It is better to have comments above this and all other related > functions. > > > > Check atomics.h, there's comments above it: > > /* > > * pg_atomic_compare_exchange_u32 - CAS operation > > * > > * Atomically compare the current value of ptr with *expected and store > newval > > * iff ptr and *expected have the same value. The current value of *ptr > will > > * always be stored in *expected. > > * > > * Return whether the values have been exchanged. > > * > > * Full barrier semantics. > > */ > > Okay, generally code has such comments on top of function > definition rather than with declaration. I don't want to have it on the _impl functions because they're duplicated for the individual platforms and will just become out of date... Greetings, Andres Freund
On 2014-06-28 19:36:39 +0300, Heikki Linnakangas wrote: > On 06/27/2014 08:15 PM, Robert Haas wrote: > >On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >>I don't really see usecases where it's not related data that's being > >>touched, but: The point is that the fastpath (not using a spinlock) might > >>touch the atomic variable, even while the slowpath has acquired the > >>spinlock. So the slowpath can't just use non-atomic operations on the > >>atomic variable. > >>Imagine something like releasing a buffer pin while somebody else is > >>doing something that requires holding the buffer header spinlock. > > > >If the atomic variable can be manipulated without the spinlock under > >*any* circumstances, then how is it a good idea to ever manipulate it > >with the spinlock? > > With the WALInsertLock scaling stuff in 9.4, there are now two variables > protected by a spinlock: the current WAL insert location, and the prev > pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need > to grab the spinlock to update both of them atomically. But to just read the > WAL insert pointer, you could do an atomic read of CurrBytePos if the > architecture supports it - now you have to grab the spinlock. > Admittedly that's just an atomic read, not an atomic compare and exchange or > fetch-and-add. There's several architectures where you can do atomic 8byte reads, but only busing cmpxchg or similar, *not* using a plain read... So I think it's actually a fair example. > Or if the architecture has an atomic 128-bit compare & > exchange op you could replace the spinlock with that. But it's not hard to > imagine similar situations where you sometimes need to lock a larger data > structure to modify it atomically, but sometimes you just need to modify > part of it and an atomic op would suffice. Yes. > I thought Andres' LWLock patch also did something like that. If the lock is > not contended, you can acquire it with an atomic compare & exchange to > increment the exclusive/shared counter. But to manipulate the wait queue, > you need the spinlock. Right. It just happens that the algorithm requires none of the atomic ops have to be under the spinlock... But I generally think that we'll see more slow/fastpath like situations cropping up where we can't always avoid the atomic op while holding the lock. How about adding a paragraph that explains that it's better to avoid atomics usage of spinlocks because of the prolonged runtime, especially due to the emulation and cache misses? But also saying it's ok if the algorithm needs it and is a sufficient benefit? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
> > On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
> > I think it is better to have some discussion about it. Apart from this,
> > today I noticed atomic read/write doesn't use any barrier semantics
> > and just rely on volatile.
>
> Yes, intentionally so. It's often important to get/set the current value
> without the cost of emitting a memory barrier. It just has to be a
> recent value and it actually has to come from memory.
> > > b) It's only supported from vista onwards. Afaik we still support XP.
> > #ifndef pg_memory_barrier_impl
> > #define pg_memory_barrier_impl() MemoryBarrier()
> > #endif
> >
> > The MemoryBarrier() macro used also supports only on vista onwards,
> > so I think for reasons similar to using MemoryBarrier() can apply for
> > this as well.
>
> I think that's odd because barrier.h has been using MemoryBarrier()
> without a version test for a long time now. I guess everyone uses a new
> enough visual studio.
> > > > 6.
> > > > pg_atomic_compare_exchange_u32()
> > > >
> > > > It is better to have comments above this and all other related
> > functions.
> > Okay, generally code has such comments on top of function
> > definition rather than with declaration.
>
> I don't want to have it on the _impl functions because they're
> duplicated for the individual platforms and will just become out of
> date...
> On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
> > On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
> > I think it is better to have some discussion about it. Apart from this,
> > today I noticed atomic read/write doesn't use any barrier semantics
> > and just rely on volatile.
>
> Yes, intentionally so. It's often important to get/set the current value
> without the cost of emitting a memory barrier. It just has to be a
> recent value and it actually has to come from memory.
I agree with you that we don't want to incur the cost of memory barrier
for get/set, however it should not be at the cost of correctness.
> And that's actually
> enforced by the current definition - I think?
> enforced by the current definition - I think?
Yeah, this is the only point which I am trying to ensure, thats why I sent
you few links in previous mail where I had got some suspicion that
just doing get/set with volatile might lead to correctness issue in some
cases.
Some another Note, I found today while reading more on it which suggests
that my previous observation is right:
"Within a thread of execution, accesses (reads and writes) to all volatile
objects are guaranteed to not be reordered relative to each other, but this
order is not guaranteed to be observed by another thread, since volatile
access does not establish inter-thread synchronization."
> > > b) It's only supported from vista onwards. Afaik we still support XP.
> > #ifndef pg_memory_barrier_impl
> > #define pg_memory_barrier_impl() MemoryBarrier()
> > #endif
> >
> > The MemoryBarrier() macro used also supports only on vista onwards,
> > so I think for reasons similar to using MemoryBarrier() can apply for
> > this as well.
>
> I think that's odd because barrier.h has been using MemoryBarrier()
> without a version test for a long time now. I guess everyone uses a new
> enough visual studio.
Yeap or might be the people who even are not using new enough version
doesn't have a use case that can hit a problem due to MemoryBarrier()
> Those intrinsics aren't actually OS but just
> compiler dependent.
>
> Otherwise we could just define it as:
>
> FORCEINLINE
> VOID
> MemoryBarrier (
> VOID
> )
> {
> LONG Barrier;
> __asm {
> xchg Barrier, eax
> }
> }
Yes, I think it will be better, how about defining this way just for
> compiler dependent.
>
> Otherwise we could just define it as:
>
> FORCEINLINE
> VOID
> MemoryBarrier (
> VOID
> )
> {
> LONG Barrier;
> __asm {
> xchg Barrier, eax
> }
> }
Yes, I think it will be better, how about defining this way just for
the cases where MemoryBarrier macro is not supported.
> > > c) It doesn't make any difference on x86 ;)
> >
> > What about processors like Itanium which care about acquire/release
> > memory barrier semantics?
>
> Well, it still will be correct? I don't think it makes much sense to
> focus overly much on itanium here with the price of making anything more
> complicated for others.
Completely agreed that we should not add or change anything which
> > > c) It doesn't make any difference on x86 ;)
> >
> > What about processors like Itanium which care about acquire/release
> > memory barrier semantics?
>
> Well, it still will be correct? I don't think it makes much sense to
> focus overly much on itanium here with the price of making anything more
> complicated for others.
Completely agreed that we should not add or change anything which
makes patch complicated or can effect the performance, however
the reason for focusing is just to ensure that we should not break
any existing use case for Itanium w.r.t performance or otherwise.
Another way is that we can give ignore or just give lower priority for
verfiying if the patch has anything wrong for Itanium processors.
> > If I am reading C11's spec for this API correctly, then it says as below:
> > "Atomically compares the value pointed to by obj with the value pointed
> > to by expected, and if those are equal, replaces the former with desired
> > (performs read-modify-write operation). Otherwise, loads the actual value
> > pointed to by obj into *expected (performs load operation)."
> >
> > So it essentialy means that it loads actual value in expected only if
> > operation is not successful.
>
> Yes. But in the case it's successfull it will already contain the right
> value.
The point was just that we are not completely following C11 atomic
> > "Atomically compares the value pointed to by obj with the value pointed
> > to by expected, and if those are equal, replaces the former with desired
> > (performs read-modify-write operation). Otherwise, loads the actual value
> > pointed to by obj into *expected (performs load operation)."
> >
> > So it essentialy means that it loads actual value in expected only if
> > operation is not successful.
>
> Yes. But in the case it's successfull it will already contain the right
> value.
The point was just that we are not completely following C11 atomic
specification, so it might be okay to tweak a bit and make things
simpler and save LOAD instruction. However as there is no correctness
issue here and you think that current implementation is good and
can serve more cases, we can keep as it is and move forward.
> > > > 4.
> > ..
> > > > There is a Caution notice in microsoft site indicating
> > > > _ReadWriteBarrier/MemoryBarrier are deprected.
> > >
> > > It seemed to be the most widely available API, and it's what barrier.h
> > > already used.
> > > Do you have a different suggestion?
> >
> > I am trying to think based on suggestion given by Microsoft, but
> > not completely clear how to accomplish the same at this moment.
>
> Well, they refer to C11 stuff. But I think it'll be a while before it
> makes sense to use a fallback based on that.
In this case, I have a question for you.
> > > > 4.
> > ..
> > > > There is a Caution notice in microsoft site indicating
> > > > _ReadWriteBarrier/MemoryBarrier are deprected.
> > >
> > > It seemed to be the most widely available API, and it's what barrier.h
> > > already used.
> > > Do you have a different suggestion?
> >
> > I am trying to think based on suggestion given by Microsoft, but
> > not completely clear how to accomplish the same at this moment.
>
> Well, they refer to C11 stuff. But I think it'll be a while before it
> makes sense to use a fallback based on that.
In this case, I have a question for you.
Un-patched usage in barrier.h is as follows:
..
#elif defined(__ia64__) || defined(__ia64)
#define pg_compiler_barrier() _Asm_sched_fence()
#define pg_memory_barrier() _Asm_mf()
#elif defined(WIN32_ONLY_COMPILER)
/* Should work on both MSVC and Borland. */
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier() _ReadWriteBarrier()
#define pg_memory_barrier() MemoryBarrier()
#endif
#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
# include "storage/atomics-generic-gcc.h"
#elif defined(WIN32_ONLY_COMPILER)
# include "storage/atomics-generic-msvc.h"
#define pg_compiler_barrier() _Asm_sched_fence()
#define pg_memory_barrier() _Asm_mf()
#elif defined(WIN32_ONLY_COMPILER)
/* Should work on both MSVC and Borland. */
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier() _ReadWriteBarrier()
#define pg_memory_barrier() MemoryBarrier()
#endif
If I understand correctly the current define mechanism in barrier.h,
it will have different definition for Itanium processors even for windows.
However the patch defines as below:
# include "storage/atomics-generic-gcc.h"
#elif defined(WIN32_ONLY_COMPILER)
# include "storage/atomics-generic-msvc.h"
What I can understand from above is that defines in
storage/atomics-generic-msvc.h, will override any previous defines
for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
will be considered for Windows always.
> > > > 6.
> > > > pg_atomic_compare_exchange_u32()
> > > >
> > > > It is better to have comments above this and all other related
> > functions.
> > Okay, generally code has such comments on top of function
> > definition rather than with declaration.
>
> I don't want to have it on the _impl functions because they're
> duplicated for the individual platforms and will just become out of
> date...
Sure, I was also not asking for _impl functions. What I was asking
in this point was to have comments on top of definition of
pg_atomic_compare_exchange_u32() in atomics.h
In particular, on top of below and similar functions, rather than
at the place where they are declared.
STATIC_IF_INLINE bool
pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
CHECK_POINTER_ALIGNMENT(ptr, 4);
CHECK_POINTER_ALIGNMENT(expected, 4);
return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
}
pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
CHECK_POINTER_ALIGNMENT(ptr, 4);
CHECK_POINTER_ALIGNMENT(expected, 4);
return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
}
On 2014-06-30 11:04:53 +0530, Amit Kapila wrote: > On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-06-29 11:53:28 +0530, Amit Kapila wrote: > > > On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com> > > > I think it is better to have some discussion about it. Apart from this, > > > today I noticed atomic read/write doesn't use any barrier semantics > > > and just rely on volatile. > > > > Yes, intentionally so. It's often important to get/set the current value > > without the cost of emitting a memory barrier. It just has to be a > > recent value and it actually has to come from memory. > > I agree with you that we don't want to incur the cost of memory barrier > for get/set, however it should not be at the cost of correctness. I really can't follow here. A volatile read/store forces it to go to memory without barriers. The ABIs of all platforms we work with guarantee that 4bytes stores/reads are atomic - we've been relying on that for a long while. > > And that's actually > > enforced by the current definition - I think? > > Yeah, this is the only point which I am trying to ensure, thats why I sent > you few links in previous mail where I had got some suspicion that > just doing get/set with volatile might lead to correctness issue in some > cases. But this isn't something this patch started doing - we've been doing that for a long while? > Some another Note, I found today while reading more on it which suggests > that my previous observation is right: > > "Within a thread of execution, accesses (reads and writes) to all volatile > objects are guaranteed to not be reordered relative to each other, but this > order is not guaranteed to be observed by another thread, since volatile > access does not establish inter-thread synchronization." > http://en.cppreference.com/w/c/atomic/memory_order That's just saying there's no ordering guarantees with volatile stores/reads. I don't see a contradiction? > > > > b) It's only supported from vista onwards. Afaik we still support XP. > > > #ifndef pg_memory_barrier_impl > > > #define pg_memory_barrier_impl() MemoryBarrier() > > > #endif > > > > > > The MemoryBarrier() macro used also supports only on vista onwards, > > > so I think for reasons similar to using MemoryBarrier() can apply for > > > this as well. > > > > I think that's odd because barrier.h has been using MemoryBarrier() > > without a version test for a long time now. I guess everyone uses a new > > enough visual studio. > > Yeap or might be the people who even are not using new enough version > doesn't have a use case that can hit a problem due to MemoryBarrier() Well, with barrier.h as it stands they'd get a compiler error if it indeed is unsupported? But I think there's something else going on - msft might just be removing XP from it's documentation. Postgres supports building with with visual studio 2005 and MemoryBarrier() seems to be supported by it. I think we'd otherwise seen problems since 9.1 where barrier.h was introduced. > In this case, I have a question for you. > > Un-patched usage in barrier.h is as follows: > .. > #elif defined(__ia64__) || defined(__ia64) > #define pg_compiler_barrier() _Asm_sched_fence() > #define pg_memory_barrier() _Asm_mf() > > #elif defined(WIN32_ONLY_COMPILER) > /* Should work on both MSVC and Borland. */ > #include <intrin.h> > #pragma intrinsic(_ReadWriteBarrier) > #define pg_compiler_barrier() _ReadWriteBarrier() > #define pg_memory_barrier() MemoryBarrier() > #endif > > If I understand correctly the current define mechanism in barrier.h, > it will have different definition for Itanium processors even for windows. Either noone has ever tested postgres on itanium windows (quite possible), because afaik _Asm_sched_fence() doesn't work on windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros arefor HP's acc on HPUX. > However the patch defines as below: > > #if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS) > # include "storage/atomics-generic-gcc.h" > #elif defined(WIN32_ONLY_COMPILER) > # include "storage/atomics-generic-msvc.h" > > What I can understand from above is that defines in > storage/atomics-generic-msvc.h, will override any previous defines > for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier() > will be considered for Windows always. Well, the memory barrier is surrounded by #ifndef pg_memory_barrier_impl. The compiler barrier can't reasonably be defined earlier since it's a compiler not an architecture thing. > > > > > 6. > > > > > pg_atomic_compare_exchange_u32() > > > > > > > > > > It is better to have comments above this and all other related > > > functions. > > > Okay, generally code has such comments on top of function > > > definition rather than with declaration. > > > > I don't want to have it on the _impl functions because they're > > duplicated for the individual platforms and will just become out of > > date... > > Sure, I was also not asking for _impl functions. What I was asking > in this point was to have comments on top of definition of > pg_atomic_compare_exchange_u32() in atomics.h > In particular, on top of below and similar functions, rather than > at the place where they are declared. Hm, we can do that. Don't think it'll be clearer (because you need to go to the header anyway), but I don't feel strongly. I'd much rather get rid of the separated definition/declaration, but we'd need to rely on PG_USE_INLINE for it... Greetings, Andres Freund
On Sat, Jun 28, 2014 at 4:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > To make pin/unpin et al safe without acquiring the spinlock the pincount >> > and the flags had to be moved into one variable so that when >> > incrementing the pincount you automatically get the flagbits back. If >> > it's currently valid all is good, otherwise the spinlock needs to be >> > acquired. But for that to work you need to set flags while the spinlock >> > is held. >> > >> > Possibly you can come up with ways to avoid that, but I am pretty damn >> > sure that the code will be less robust by forbidding atomics inside a >> > critical section, not the contrary. It's a good idea to avoid it, but >> > not at all costs. >> >> You might be right, but I'm not convinced. > > Why? Anything I can do to convince you of this? Note that other users of > atomics (e.g. the linux kernel) widely use atomics inside spinlock > protected regions. The Linux kernel isn't a great example, because they can ensure that they aren't preempted while holding the spinlock. We can't, and that's a principle part of my concern here. More broadly, it doesn't seem consistent. I think that other projects also sometimes write code that acquires a spinlock while holding another spinlock, and we don't do that; in fact, we've elevated that to a principle, regardless of whether it performs well in practice. In some cases, the CPU instruction that we issue to acquire that spinlock might be the exact same instruction we'd issue to manipulate an atomic variable. I don't see how it can be right to say that a spinlock-protected critical section is allowed to manipulate an atomic variable with cmpxchg, but not allowed to acquire another spinlock with cmpxchg. Either acquiring the second spinlock is OK, in which case our current coding rule is wrong, or acquiring the atomic variable is not OK, either. I don't see how we can have that both ways. >> What I'm basically afraid of is that this will work fine in many cases >> but have really ugly failure modes. That's pretty much what happens >> with spinlocks already - the overhead is insignificant at low levels >> of contention but as the spinlock starts to become contended the CPUs >> all fight over the cache line and performance goes to pot. ISTM that >> making the spinlock critical section significantly longer by putting >> atomic ops in there could appear to win in some cases while being very >> bad in others. > > Well, I'm not saying it's something I suggest doing all the time. But if > using an atomic op in the slow path allows you to remove the spinlock > from 99% of the cases I don't see it having a significant impact. > In most scenarios (where atomics aren't emulated, i.e. platforms we > expect to used in heavily concurrent cases) the spinlock and the atomic > will be on the same cacheline making stalls much less likely. And this again is my point: why can't we make the same argument about two spinlocks situated on the same cache line? I don't have a bit of trouble believing that doing the same thing with a couple of spinlocks could sometimes work out well, too, but Tom is adamantly opposed to that. I don't want to just make an end run around that issue without understanding whether he has a valid point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-30 12:15:06 -0400, Robert Haas wrote: > More broadly, it doesn't seem consistent. I think that other projects > also sometimes write code that acquires a spinlock while holding > another spinlock, and we don't do that; in fact, we've elevated that > to a principle, regardless of whether it performs well in practice. It's actually more than a principle: It's now required for correctness, because otherwise the semaphore based spinlock implementation will deadlock otherwise. > In some cases, the CPU instruction that we issue to acquire that > spinlock might be the exact same instruction we'd issue to manipulate > an atomic variable. I don't see how it can be right to say that a > spinlock-protected critical section is allowed to manipulate an atomic > variable with cmpxchg, but not allowed to acquire another spinlock > with cmpxchg. Either acquiring the second spinlock is OK, in which > case our current coding rule is wrong, or acquiring the atomic > variable is not OK, either. I don't see how we can have that both > ways. The no nested spinlock thing used to be a guideline. One nobody had big problems with because it didn't prohibit solutions to problems. Since guidelines are more useful when simple it's "no nested spinlocks!". Also those two cmpxchg's aren't the same: The CAS for spinlock acquiration will only succeed if the lock isn't acquired. If the lock holder sleeps you'll have to wait for it to wake up. In contrast to that CASs for lockfree (or lock-reduced) algorithms won't be blocked if the last manipulation was done by a backend that's now sleeping. It'll possibly loop once, get the new value, and retry the CAS. Yes, it can still take couple of iterations under heavy concurrency, but you don't have the problem that the lockholder goes to sleep. Now, you can argue that the spinlock based fallbacks make that difference moot, but I think that's just the price for an emulation fringe platforms will have to pay. They aren't platforms used under heavy concurrency. > >> What I'm basically afraid of is that this will work fine in many cases > >> but have really ugly failure modes. That's pretty much what happens > >> with spinlocks already - the overhead is insignificant at low levels > >> of contention but as the spinlock starts to become contended the CPUs > >> all fight over the cache line and performance goes to pot. ISTM that > >> making the spinlock critical section significantly longer by putting > >> atomic ops in there could appear to win in some cases while being very > >> bad in others. > > > > Well, I'm not saying it's something I suggest doing all the time. But if > > using an atomic op in the slow path allows you to remove the spinlock > > from 99% of the cases I don't see it having a significant impact. > > In most scenarios (where atomics aren't emulated, i.e. platforms we > > expect to used in heavily concurrent cases) the spinlock and the atomic > > will be on the same cacheline making stalls much less likely. > > And this again is my point: why can't we make the same argument about > two spinlocks situated on the same cache line? Because it's not relevant? Where and why do we want to acquire two spinlocks that are on the same cacheline? As far as I know there's never been an actual need for nested spinlocks. So the guideline can be as restrictive as is it is because the price is low and there's no benefit in making it more complex. I think this line of discussion is pretty backwards. The reason I (and seemingly Heikki) want to use atomics is that it can reduce contention significantly. That contention is today too a good part based on spinlocks. Your argument is basically that increasing spinlock contention by doing things that can take more than a fixed cycles can increase contention. But the changes we've talked about only make sense if they significantly reduce spinlock contention in the first place - a potential for a couple iterations while holding better not be relevant for proposed patches. I don't think a blanket rule makes sense here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > ... this again is my point: why can't we make the same argument about > two spinlocks situated on the same cache line? I don't have a bit of > trouble believing that doing the same thing with a couple of spinlocks > could sometimes work out well, too, but Tom is adamantly opposed to > that. I think you might be overstating my position here. What I'm concerned about is that we be sure that spinlocks are held for a sufficiently short time that it's very unlikely that we get pre-empted while holding one. I don't have any particular bright line about how short a time that is, but more than "a few instructions" worries me. As you say, the Linux kernel is a bad example to follow because it hasn't got a risk of losing its timeslice while holding a spinlock. The existing coding rules discourage looping (though I might be okay with a small constant loop count), and subroutine calls (mainly because somebody might add $random_amount_of_work to the subroutine if they don't realize it can be called while holding a spinlock). Both of these rules are meant to reduce the risk that a short interval balloons into a long one due to unrelated coding changes. The existing coding rules also discourage spinlocking within a spinlock, and the reason for that is that there's no very clear upper bound to the time required to obtain a spinlock, so that there would also be no clear upper bound to the time you're holding the original one (thus possibly leading to cascading performance problems). So ISTM the question we ought to be asking is whether atomic operations have bounded execution time, or more generally what the distribution of execution times is likely to be. I'd be OK with an answer that includes "sometimes it can be long" so long as "sometimes" is "pretty damn seldom". Spinlocks have a nonzero risk of taking a long time already, since we can't entirely prevent the possibility of losing our timeslice while holding one. The issue here is just to be sure that that happens seldom enough that it doesn't cause performance problems. If we fail to do that we might negate all the desired performance improvements from adopting atomic ops in the first place. regards, tom lane
On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The existing coding rules also discourage spinlocking within a spinlock, > and the reason for that is that there's no very clear upper bound to the > time required to obtain a spinlock, so that there would also be no clear > upper bound to the time you're holding the original one (thus possibly > leading to cascading performance problems). Well, as Andres points out, commit daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad requirement than a suggestion. If it's sometimes OK to acquire a spinlock from within a spinlock, then that commit was badly misguided; but the design of that patch was pretty much driven by your insistence that that was never, ever OK. If that was wrong, then we should reconsider that whole commit more broadly; but if it was right, then the atomics patch shouldn't drill a hole through it to install an exception just for emulating atomics. I'm personally not convinced that we're approaching this topic in the right way. I'm not convinced that it's at all reasonable to try to emulate atomics on platforms that don't have them. I would punt the problem into the next layer and force things like lwlock.c to have fallback implementations at that level that don't require atomics, and remove those fallback implementations if and when we move the goalposts so that all supported platforms must have working atomics implementations. People who write code that uses atomics are not likely to think about how those algorithms will actually perform when those atomics are merely emulated, and I suspect that means that in practice platforms that have only emulated atomics are going to regress significantly vs. the status quo today. If this patch goes in the way it's written right now, then I think it basically amounts to throwing platforms that don't have working atomics emulation under the bus, and my vote is that if we're going to do that, we should do it explicitly: sorry, we now only support platforms with working atomics. You don't have any, so you can't run PostgreSQL. Have a nice day. If we *don't* want to make that decision, then I'm not convinced that the approach this patch takes is in any way viable, completely aside from this particular question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-30 13:15:23 -0400, Robert Haas wrote: > On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The existing coding rules also discourage spinlocking within a spinlock, > > and the reason for that is that there's no very clear upper bound to the > > time required to obtain a spinlock, so that there would also be no clear > > upper bound to the time you're holding the original one (thus possibly > > leading to cascading performance problems). > > Well, as Andres points out, commit > daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad > requirement than a suggestion. If it's sometimes OK to acquire a > spinlock from within a spinlock, then that commit was badly misguided; > but the design of that patch was pretty much driven by your insistence > that that was never, ever OK. If that was wrong, then we should > reconsider that whole commit more broadly; but if it was right, then > the atomics patch shouldn't drill a hole through it to install an > exception just for emulating atomics. Well, since nobody has a problem with the rule of not having nested spinlocks and since the problems aren't the same I'm failing to see why we should reconsider this. My recollection was that we primarily discussed whether it'd be a good idea to add code to Assert() when spinlocks are nested to which Tom responded that it's not necessary because critical sections should be short enough to immmediately see that that's not a problem. Then we argued a bit back and forth around that. > I'm personally not convinced that we're approaching this topic in the > right way. I'm not convinced that it's at all reasonable to try to > emulate atomics on platforms that don't have them. I would punt the > problem into the next layer and force things like lwlock.c to have > fallback implementations at that level that don't require atomics, and > remove those fallback implementations if and when we move the > goalposts so that all supported platforms must have working atomics > implementations. If we're requiring a atomics-less implementation for lwlock.c, bufmgr. et al. I'll stop working on those features (and by extension on atomics). It's an utter waste of resources and maintainability trying to maintain parallel high level locking algorithms in several pieces of the codebase. The nonatomic versions will stagnate and won't actually work under load. I'd rather spend my time on something useful. The spinlock based atomics emulation is *small*. It's easy to reason about correctness there. If we're going that way, we've made a couple of absolute fringe platforms hold postgres, as actually used in reality, hostage. I think that's insane. > People who write code that uses atomics are not > likely to think about how those algorithms will actually perform when > those atomics are merely emulated, and I suspect that means that in > practice platforms that have only emulated atomics are going to > regress significantly vs. the status quo today. I think you're overstating the likely performance penalty for nonparallel platforms/workloads here quite a bit. The platforms without changes of gettings atomics implemented aren't ones where that's a likely thing. Yes, you'll see a regression if you run a readonly pgbench over a 4 node NUMA system - but it's not large. You won't see much of improved performance in comparison to 9.4, but I think that's primarily it. Also it's not like this is something new - the semaphore based fallback *sucks* but we still have it. *Many* features in new major versions regress performance for fringe platforms. That's normal. > If this patch goes in > the way it's written right now, then I think it basically amounts to > throwing platforms that don't have working atomics emulation under the > bus Why? Nobody is going to run 9.5 under high performance workloads on such platforms. They'll be so IO and RAM starved that the spinlocks won't be the bottleneck. >, and my vote is that if we're going to do that, we should do it > explicitly: sorry, we now only support platforms with working atomics. > You don't have any, so you can't run PostgreSQL. Have a nice day. I'd be fine with that. I don't think we're gaining much by those platforms. *But* I don't really see why it's required at this point. The fallback works and unless you're using semaphore based spinlocks the performance isn't bad. The lwlock code doesn't really get slower/faster in comparison to before when the atomics implementation is backed by semaphores. It's pretty much the same number of syscalls using the atomics/current implementation. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > I'm personally not convinced that we're approaching this topic in the > right way. I'm not convinced that it's at all reasonable to try to > emulate atomics on platforms that don't have them. I would punt the > problem into the next layer and force things like lwlock.c to have > fallback implementations at that level that don't require atomics, and > remove those fallback implementations if and when we move the > goalposts so that all supported platforms must have working atomics > implementations. People who write code that uses atomics are not > likely to think about how those algorithms will actually perform when > those atomics are merely emulated, and I suspect that means that in > practice platforms that have only emulated atomics are going to > regress significantly vs. the status quo today. I think this is a valid objection, and I for one am not prepared to say that we no longer care about platforms that don't have atomic ops (especially not if it's not a *very small* set of atomic ops). Also, just because a platform claims to have atomic ops doesn't mean that those ops perform well. If there's a kernel trap involved, they don't, at least not for our purposes. We're only going to be bothering with installing atomic-op code in places that are contention bottlenecks for us already, so we are not going to be happy with the results for any atomic-op implementation that's not industrial strength. This is one reason why I'm extremely suspicious of depending on gcc's intrinsics for this; that will not make the issue go away, only make it beyond our power to control. regards, tom lane
On 2014-06-30 12:54:10 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > ... this again is my point: why can't we make the same argument about > > two spinlocks situated on the same cache line? I don't have a bit of > > trouble believing that doing the same thing with a couple of spinlocks > > could sometimes work out well, too, but Tom is adamantly opposed to > > that. > > I think you might be overstating my position here. What I'm concerned > about is that we be sure that spinlocks are held for a sufficiently short > time that it's very unlikely that we get pre-empted while holding one. > I don't have any particular bright line about how short a time that is, > but more than "a few instructions" worries me. As you say, the Linux > kernel is a bad example to follow because it hasn't got a risk of losing > its timeslice while holding a spinlock. Well, they actually have preemptable and irq-interruptile versions for some of these locks, but whatever :). > So ISTM the question we ought to be asking is whether atomic operations > have bounded execution time, or more generally what the distribution of > execution times is likely to be. I'd be OK with an answer that includes > "sometimes it can be long" so long as "sometimes" is "pretty damn > seldom". I think it ranges from 'never' to 'sometimes, but pretty darn seldom'. Depending on the platform and emulation. And, leaving the emulation and badly written code aside, there's no issue with another backend sleeping while the atomic variable needs to be modified. > Spinlocks have a nonzero risk of taking a long time already, since we > can't entirely prevent the possibility of losing our timeslice while > holding one. The issue here is just to be sure that that happens seldom > enough that it doesn't cause performance problems. If we fail to do that > we might negate all the desired performance improvements from adopting > atomic ops in the first place. I think individual patches will have to stand up to a test like this. Modifying a atomic variable inside a spinlock is only going to be worth it if it allows to avoid spinlocks alltogether in 95%+ of the time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-06-30 19:44:47 +0200, Andres Freund wrote: > On 2014-06-30 13:15:23 -0400, Robert Haas wrote: > > People who write code that uses atomics are not > > likely to think about how those algorithms will actually perform when > > those atomics are merely emulated, and I suspect that means that in > > practice platforms that have only emulated atomics are going to > > regress significantly vs. the status quo today. > > I think you're overstating the likely performance penalty for > nonparallel platforms/workloads here quite a bit. The platforms without > changes of gettings atomics implemented aren't ones where that's a > likely thing. Yes, you'll see a regression if you run a readonly pgbench > over a 4 node NUMA system - but it's not large. You won't see much of > improved performance in comparison to 9.4, but I think that's primarily > it. To quantify this, on my 2 socket xeon E5520 workstation - which is too small to heavily show the contention problems in pgbench -S - the numbers are: pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability) master: 152354.294117 lwlocks-atomics: 170528.784958 lwlocks-atomics-spin: 159416.202578 pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability) master: 16273.702191 lwlocks-atomics: 16768.712433 lwlocks-atomics-spin: 16744.913083 So, there really isn't a problem with the emulation. It's not actually that surprising - the absolute number of atomic ops is prety much the same. Where we earlier took a spinlock to increment shared we now still take one. I expect that repeating this on the 4 socket machine will show a large gap between lwlocks-atomics and the other two. The other two will be about the same, with lwlocks-atomics-spin remaining a bit better than master. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-06-30 13:45:52 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I'm personally not convinced that we're approaching this topic in the > > right way. I'm not convinced that it's at all reasonable to try to > > emulate atomics on platforms that don't have them. I would punt the > > problem into the next layer and force things like lwlock.c to have > > fallback implementations at that level that don't require atomics, and > > remove those fallback implementations if and when we move the > > goalposts so that all supported platforms must have working atomics > > implementations. People who write code that uses atomics are not > > likely to think about how those algorithms will actually perform when > > those atomics are merely emulated, and I suspect that means that in > > practice platforms that have only emulated atomics are going to > > regress significantly vs. the status quo today. > > I think this is a valid objection, and I for one am not prepared to > say that we no longer care about platforms that don't have atomic ops > (especially not if it's not a *very small* set of atomic ops). It's still TAS(), cmpxchg(), xadd. With TAS/xadd possibly implemented via cmpxchg (which quite possibly *is* the native implementation for $arch). > Also, just because a platform claims to have atomic ops doesn't mean that > those ops perform well. If there's a kernel trap involved, they don't, > at least not for our purposes. Yea. I think if we start to use 8byte atomics at some later point we're going to have to prohibit e.g. older arms from using them, even if the compiler claims they're supported. But that's just one #define. > This is one > reason why I'm extremely suspicious of depending on gcc's intrinsics for > this; that will not make the issue go away, only make it beyond our > power to control. The patch as submitted makes it easy to provide a platform specific implementation of the important routines if it's likely that it's better than the intrinsics provided one. I'm not too worried about the intrinsics though - the number of projects relying on them these days is quite large. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 30, 2014 at 07:44:47PM +0200, Andres Freund wrote: > On 2014-06-30 13:15:23 -0400, Robert Haas wrote: > > I'm personally not convinced that we're approaching this topic in the > > right way. I'm not convinced that it's at all reasonable to try to > > emulate atomics on platforms that don't have them. I would punt the > > problem into the next layer and force things like lwlock.c to have > > fallback implementations at that level that don't require atomics, and > > remove those fallback implementations if and when we move the > > goalposts so that all supported platforms must have working atomics > > implementations. > > If we're requiring a atomics-less implementation for lwlock.c, > bufmgr. et al. I'll stop working on those features (and by extension on > atomics). It's an utter waste of resources and maintainability trying to > maintain parallel high level locking algorithms in several pieces of the > codebase. The nonatomic versions will stagnate and won't actually work > under load. I'd rather spend my time on something useful. The spinlock > based atomics emulation is *small*. It's easy to reason about > correctness there. > > If we're going that way, we've made a couple of absolute fringe > platforms hold postgres, as actually used in reality, hostage. I think > that's insane. I agree completely. > > If this patch goes in > > the way it's written right now, then I think it basically amounts to > > throwing platforms that don't have working atomics emulation under the > > bus > >, and my vote is that if we're going to do that, we should do it > > explicitly: sorry, we now only support platforms with working atomics. > > You don't have any, so you can't run PostgreSQL. Have a nice day. I might share that view if a platform faced a huge and easily-seen performance regression, say a 40% regression to 5-client performance. I don't expect the shortfall to reach that ballpark for any atomics-exploiting algorithm we'll adopt, and your (Andres's) latest measurements corroborate that. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 30, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:
> > On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > Yes, intentionally so. It's often important to get/set the current value
> > > without the cost of emitting a memory barrier. It just has to be a
> > > recent value and it actually has to come from memory.
> >
> > I agree with you that we don't want to incur the cost of memory barrier
> > for get/set, however it should not be at the cost of correctness.
>
> I really can't follow here. A volatile read/store forces it to go to
> memory without barriers. The ABIs of all platforms we work with
> guarantee that 4bytes stores/reads are atomic - we've been relying on
> that for a long while.
I think for such usage, we need to rely on barriers wherever there is a
if somewhere we need to use pg_atomic_compare_exchange_u32(),
> On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:
> > On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > Yes, intentionally so. It's often important to get/set the current value
> > > without the cost of emitting a memory barrier. It just has to be a
> > > recent value and it actually has to come from memory.
> >
> > I agree with you that we don't want to incur the cost of memory barrier
> > for get/set, however it should not be at the cost of correctness.
>
> I really can't follow here. A volatile read/store forces it to go to
> memory without barriers. The ABIs of all platforms we work with
> guarantee that 4bytes stores/reads are atomic - we've been relying on
> that for a long while.
I think for such usage, we need to rely on barriers wherever there is a
need for synchronisation, recent example I have noticed is in your patch
where we have to use pg_write_barrier() during wakeup. However if we
go by atomic ops definition, then no such dependency is required, likeif somewhere we need to use pg_atomic_compare_exchange_u32(),
after that we don't need to ensure about barriers, because this atomic
API adheres to Full barrier semantics.
I think defining barrier support for some atomic ops and not for others
will lead to inconsistency with specs and we need to additionally
define rules for such atomic ops usage.
I think we can still rely on volatile read/store for ordering guarantee
within same thread of execution and where ever we need synchronisation
of usage among different processes, we can directly use atomic ops
(get/set) without the need to use barriers.
> > In this case, I have a question for you.
> >
> > Un-patched usage in barrier.h is as follows:
> > ..
> >
> > If I understand correctly the current define mechanism in barrier.h,
> > it will have different definition for Itanium processors even for windows.
>
> Either noone has ever tested postgres on itanium windows (quite
> possible), because afaik _Asm_sched_fence() doesn't work on
> windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> arefor HP's acc on HPUX.
Hmm.. Do you think in such a case, it would have gone in below
> > > > > > 6.
> > Sure, I was also not asking for _impl functions. What I was asking
> > in this point was to have comments on top of definition of
> > pg_atomic_compare_exchange_u32() in atomics.h
> > In particular, on top of below and similar functions, rather than
> > at the place where they are declared.
>
> Hm, we can do that. Don't think it'll be clearer (because you need to go
> to the header anyway), but I don't feel strongly.
I think this depends on individual's perspective, so do the way
> > > > > b) It's only supported from vista onwards. Afaik we still support XP.
>
> Well, with barrier.h as it stands they'd get a compiler error if it
> indeed is unsupported? But I think there's something else going on -
> msft might just be removing XP from it's documentation.
>
> Well, with barrier.h as it stands they'd get a compiler error if it
> indeed is unsupported? But I think there's something else going on -
> msft might just be removing XP from it's documentation.
Okay, but why would they remove for MemoryBarrier() and not
for InterlockedCompareExchange(). I think it is bit difficult to predict
the reason and if we want to use anything which is not as msft docs,
we shall do that carefully and have some way to ensure that it will
work fine.
> > In this case, I have a question for you.
> >
> > Un-patched usage in barrier.h is as follows:
> > ..
> >
> > If I understand correctly the current define mechanism in barrier.h,
> > it will have different definition for Itanium processors even for windows.
>
> Either noone has ever tested postgres on itanium windows (quite
> possible), because afaik _Asm_sched_fence() doesn't work on
> windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> arefor HP's acc on HPUX.
Hmm.. Do you think in such a case, it would have gone in below
define:
#elif defined(__INTEL_COMPILER)
/*
* icc defines __GNUC__, but doesn't support gcc's inline asm syntax
*/
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif
-#define pg_compiler_barrier() __memory_barrier()
> > However the patch defines as below:
/*
* icc defines __GNUC__, but doesn't support gcc's inline asm syntax
*/
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif
Currently this will be considered as compiler barrier for
__INTEL_COMPILER, but after patch, I don't see this define. I think
after patch, it will be compiler_impl specific, but why such a change?
> > However the patch defines as below:
..
> > What I can understand from above is that defines in
> > storage/atomics-generic-msvc.h, will override any previous defines
> > for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
> > will be considered for Windows always.
>
> Well, the memory barrier is surrounded by #ifndef
> pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
> earlier since it's a compiler not an architecture thing.
I think as per your new compiler specific implementation stuff, this holds
> > What I can understand from above is that defines in
> > storage/atomics-generic-msvc.h, will override any previous defines
> > for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
> > will be considered for Windows always.
>
> Well, the memory barrier is surrounded by #ifndef
> pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
> earlier since it's a compiler not an architecture thing.
I think as per your new compiler specific implementation stuff, this holds
true.
> > > > > > 6.
> > Sure, I was also not asking for _impl functions. What I was asking
> > in this point was to have comments on top of definition of
> > pg_atomic_compare_exchange_u32() in atomics.h
> > In particular, on top of below and similar functions, rather than
> > at the place where they are declared.
>
> Hm, we can do that. Don't think it'll be clearer (because you need to go
> to the header anyway), but I don't feel strongly.
I think this depends on individual's perspective, so do the way
you feel better, the intention of this point was lets not deviate from
existing coding ways.
> I'd much rather get rid of the separated definition/declaration, but
> we'd need to rely on PG_USE_INLINE for it...
> I'd much rather get rid of the separated definition/declaration, but
> we'd need to rely on PG_USE_INLINE for it...
Okay.
On 2014-07-01 10:44:19 +0530, Amit Kapila wrote: > I think for such usage, we need to rely on barriers wherever there is a > need for synchronisation, recent example I have noticed is in your patch > where we have to use pg_write_barrier() during wakeup. However if we > go by atomic ops definition, then no such dependency is required, like > if somewhere we need to use pg_atomic_compare_exchange_u32(), > after that we don't need to ensure about barriers, because this atomic > API adheres to Full barrier semantics. > I think defining barrier support for some atomic ops and not for others > will lead to inconsistency with specs and we need to additionally > define rules for such atomic ops usage. Meh. It's quite common that read/load do not have barrier semantics. The majority of read/write users won't need barriers and it'll be expensive to provide them with them. > > > > > > b) It's only supported from vista onwards. Afaik we still support > XP. > > > > Well, with barrier.h as it stands they'd get a compiler error if it > > indeed is unsupported? But I think there's something else going on - > > msft might just be removing XP from it's documentation. > > Okay, but why would they remove for MemoryBarrier() and not > for InterlockedCompareExchange(). I think it is bit difficult to predict > the reason and if we want to use anything which is not as msft docs, > we shall do that carefully and have some way to ensure that it will > work fine. Well, we have quite good evidence for it working by knowing that postgres has in the past worked on xp? If you have a better idea, send in a patch for today's barrier.h and I'll adopt that. > > > In this case, I have a question for you. > > > > > > Un-patched usage in barrier.h is as follows: > > > .. > > > > > > If I understand correctly the current define mechanism in barrier.h, > > > it will have different definition for Itanium processors even for windows. > > > > Either noone has ever tested postgres on itanium windows (quite > > possible), because afaik _Asm_sched_fence() doesn't work on > > windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros > > arefor HP's acc on HPUX. > > Hmm.. Do you think in such a case, it would have gone in below > define: > #elif defined(__INTEL_COMPILER) > /* > * icc defines __GNUC__, but doesn't support gcc's inline asm syntax > */ > #if defined(__ia64__) || defined(__ia64) > #define pg_memory_barrier() __mf() > #elif defined(__i386__) || defined(__x86_64__) > #define pg_memory_barrier() _mm_mfence() > #endif Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's acc, I don't see why? But they should go into atomics-generic-acc.h. > -#define pg_compiler_barrier() __memory_barrier() > > Currently this will be considered as compiler barrier for > __INTEL_COMPILER, but after patch, I don't see this define. I think > after patch, it will be compiler_impl specific, but why such a change? #if defined(__INTEL_COMPILER) #define pg_compiler_barrier_impl() __memory_barrier() #else #define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory") #endif There earlier was a typo defining pg_compiler_barrier instead of pg_compiler_barrier_impl for intel, maybe that's what you were referring to? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2014-06-30 20:28:52 +0200, Andres Freund wrote: > To quantify this, on my 2 socket xeon E5520 workstation - which is too > small to heavily show the contention problems in pgbench -S - the > numbers are: > > pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability) > master: 152354.294117 > lwlocks-atomics: 170528.784958 > lwlocks-atomics-spin: 159416.202578 > > pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability) > master: 16273.702191 > lwlocks-atomics: 16768.712433 > lwlocks-atomics-spin: 16744.913083 > > So, there really isn't a problem with the emulation. It's not actually > that surprising - the absolute number of atomic ops is prety much the > same. Where we earlier took a spinlock to increment shared we now still > take one. Tom, you've voiced concern that using atomics will regress performance for platforms that don't use atomics in a nearby thread. Can these numbers at least ease those a bit? I've compared the atomics vs emulated atomics on 32bit x86, 64bit x86 and POWER7. That's all I've access to at the moment. In all cases the performance with lwlocks ontop emulated atomics was the same as the old spinlock based algorithm or even a bit better. I'm sure that it's possible to design atomics based algorithms where that's not the case, but I don't think we need to solve those before we meet them. I think for most contended places which possibly can be improved two things matter: The number of instructions that need to work atomically (i.e. TAS/CAS/xchg for spinlocks, xadd/cmpxchg for atomics) and the duration locks are held. When converting to atomics, even if emulated, the hot paths shouldn't need more locked instructions than before and often enough the duration during which spinlocks are held will be smaller. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-06-26 00:42:37 +0300, Heikki Linnakangas wrote: > On 06/25/2014 11:36 PM, Andres Freund wrote: > > > >>>- I completely loathe the file layout you've proposed in > >>>src/include/storage. If we're going to have separate #include files > >>>for each architecture (and I'd rather we didn't go that route, because > >>>it's messy and unlike what we have now), then shouldn't that stuff go > >>>in src/include/port or a new directory src/include/port/atomics? > >>>Polluting src/include/storage with a whole bunch of files called > >>>atomics-whatever.h strikes me as completely ugly, and there's a lot of > >>>duplicate code in there. > >I don't mind moving it somewhere else and it's easy enough to change. > > I think having a separate file for each architecture is nice. I totally > agree that they don't belong in src/include/storage, though. s_lock.h has > always been misplaced there, but we've let it be for historical reasons, but > now that we're adding a dozen new files, it's time to move them out. So, src/include/port/atomics.h, src/include/port/atomics/generic-$compiler.h, src/include/port/atomics/arch-$arch.h, src/include/port/atomics/fallback.h, src/include/port/atomics/generic.h, src/backend/port/atomics.c ? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:
>
> > I think defining barrier support for some atomic ops and not for others
> > will lead to inconsistency with specs and we need to additionally
> > define rules for such atomic ops usage.
>
> Meh. It's quite common that read/load do not have barrier semantics. The
> majority of read/write users won't need barriers and it'll be expensive
> to provide them with them.
So in such a case they shouldn't use atomic API and rather access
+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
..
+ pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
> On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:
>
> > I think defining barrier support for some atomic ops and not for others
> > will lead to inconsistency with specs and we need to additionally
> > define rules for such atomic ops usage.
>
> Meh. It's quite common that read/load do not have barrier semantics. The
> majority of read/write users won't need barriers and it'll be expensive
> to provide them with them.
So in such a case they shouldn't use atomic API and rather access
directly, I think part of the purpose to use atomic ops is also that they
should provide synchronisation for read/write among processes, earlier
we don't have any such facility, so wherever required we have no option
but to use barriers to achieve the same.
One more point here, in patch 64-bit atomic read/write are
implemented using *_exchange_* API, won't this automatically ensure
that 64-bit versions have barrier support?
+{
..
+ pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
..
> > > > In this case, I have a question for you.
>
> > > Either noone has ever tested postgres on itanium windows (quite
> > > possible), because afaik _Asm_sched_fence() doesn't work on
> > > windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> > > arefor HP's acc on HPUX.
> >
> > Hmm.. Do you think in such a case, it would have gone in below
> > define:
>
> > #elif defined(__INTEL_COMPILER)
> > /*
> > * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
> > */
> > #if defined(__ia64__) || defined(__ia64)
> > #define pg_memory_barrier() __mf()
> > #elif defined(__i386__) || defined(__x86_64__)
> > #define pg_memory_barrier() _mm_mfence()
> > #endif
>
> Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
> acc, I don't see why?
Sorry, I don't understand what you mean by above?
+}
> > > > > > > b) It's only supported from vista onwards. Afaik we still support
> > XP.
>
> Well, we have quite good evidence for it working by knowing that
> postgres has in the past worked on xp? If you have a better idea, send
> in a patch for today's barrier.h and I'll adopt that.
> > > > > > > b) It's only supported from vista onwards. Afaik we still support
> > XP.
>
> Well, we have quite good evidence for it working by knowing that
> postgres has in the past worked on xp? If you have a better idea, send
> in a patch for today's barrier.h and I'll adopt that.
Here the only point was to check if it is better to use version of
InterLocked... API which is a recommended for the case of Itanium-
based processor and will behave same for others. If you have any
inclination/suggestion for the same, then I can try for it.
> > > > In this case, I have a question for you.
>
> > > Either noone has ever tested postgres on itanium windows (quite
> > > possible), because afaik _Asm_sched_fence() doesn't work on
> > > windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> > > arefor HP's acc on HPUX.
> >
> > Hmm.. Do you think in such a case, it would have gone in below
> > define:
>
> > #elif defined(__INTEL_COMPILER)
> > /*
> > * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
> > */
> > #if defined(__ia64__) || defined(__ia64)
> > #define pg_memory_barrier() __mf()
> > #elif defined(__i386__) || defined(__x86_64__)
> > #define pg_memory_barrier() _mm_mfence()
> > #endif
>
> Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
> acc, I don't see why?
Sorry, I don't understand what you mean by above?
The point I wanted to clarify here was that is it possible that above
defines lead to different definitions for Itanium on windows?
> But they should go into atomics-generic-acc.h.
Okay, but in your current patch (0001-Basic-atomic-ops-implementation),
> But they should go into atomics-generic-acc.h.
Okay, but in your current patch (0001-Basic-atomic-ops-implementation),
they seem to be in different files.
> > -#define pg_compiler_barrier() __memory_barrier()
> >
> > Currently this will be considered as compiler barrier for
> > __INTEL_COMPILER, but after patch, I don't see this define. I think
> > after patch, it will be compiler_impl specific, but why such a change?
>
> #if defined(__INTEL_COMPILER)
> #define pg_compiler_barrier_impl() __memory_barrier()
> #else
> #define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")
> #endif
>
> There earlier was a typo defining pg_compiler_barrier instead of
> pg_compiler_barrier_impl for intel, maybe that's what you were referring to?
I have tried to search for this in patch file 0001-Basic-atomic-ops-implementation
> > -#define pg_compiler_barrier() __memory_barrier()
> >
> > Currently this will be considered as compiler barrier for
> > __INTEL_COMPILER, but after patch, I don't see this define. I think
> > after patch, it will be compiler_impl specific, but why such a change?
>
> #if defined(__INTEL_COMPILER)
> #define pg_compiler_barrier_impl() __memory_barrier()
> #else
> #define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")
> #endif
>
> There earlier was a typo defining pg_compiler_barrier instead of
> pg_compiler_barrier_impl for intel, maybe that's what you were referring to?
I have tried to search for this in patch file 0001-Basic-atomic-ops-implementation
and found that it has been removed from barrier.h but not added anywhere else.
On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Further review of patch:
> On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Further review of patch:
1.
/*
* pg_atomic_test_and_set_flag - TAS()
*
* Acquire/read barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
a. I think Acquire and read barrier semantics aren't equal.
With acquire semantics, "the results of the operation are available before the
results of any operation that appears after it in code" which means it applies
for both load and stores. Definition of read barrier just ensures loads.
So will be right to declare like above in comments?
b. I think it adheres to Acquire semantics for platforms where that semantics
are supported else it defaults to full memory ordering.
Example :
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
Is it right to declare that generic version of function adheres to
Acquire semantics.
2.
bool
pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
{
return TAS((slock_t *) &ptr->sema);
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
a. In above usage (ptr->sema), sema itself is not declared as volatile,
so would it be right to use it in this way for API (InterlockedCompareExchange)
expecting volatile.
b. Also shouldn't this implementation be moved to atomics-generic-msvc.h
3.
/*
* pg_atomic_unlocked_test_flag - TAS()
*
* No barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
a. There is no setting in this, so using TAS in function comments
seems to be wrong.
b. BTW, I am not able see this function in C11 specs.
4.
#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}
#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return (bool) pg_atomic_read_u32_impl(ptr);
}
Is there any reason to keep these two implementations different?
5.
atomics-generic-gcc.h
#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return ptr->value == 0;
}
#endif
Don't we need to compare it with 1 instead of 0?
6.
atomics-fallback.h
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
/*
* Can't do this in the semaphore based implementation - always return
* true. Do this in the header so compilers can optimize the test away.
*/
return true;
}
Why we can't implement this in semaphore based implementation?
7.
/*
* pg_atomic_clear_flag - release lock set by TAS()
*
* Release/write barrier semantics.
*/
STATIC_IF_INLINE_DECLARE void
pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
a. Are Release and write barriers equivalent?
With release semantics, the results of the operation are available after the
results of any operation that appears before it in code.
A write barrier acts as a compiler barrier, and orders stores.
I think both definitions are not same.
b. IIUC then current code doesn't have release semantics for unlock/clear.
We are planing to introduce it in this patch, also this doesn't follow the
specs which says that clear should have memory_order_seq_cst ordering
semantics.
As per its current usage in patch (S_UNLOCK), it seems reasonable
to have *release* semantics for this API, however I think if we use it for
generic purpose then it might not be right to define it with Release semantics.
8.
#define PG_HAS_ATOMIC_CLEAR_FLAG
static inline void
pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
{
/* XXX: release semantics suffice? */
pg_memory_barrier_impl();
pg_atomic_write_u32_impl(ptr, 0);
}
Are we sure that having memory barrier before clearing flag will
achieve *Release* semantics; as per my understanding the definition
of Release sematics is as below and above doesn't seem to follow the same.
"With release semantics, the results of the operation are available after
the results of any operation that appears before it in code."
9.
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
{
uint32 old;
while (true)
{
old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}
What is the reason to implement exchange as compare-and-exchange, at least for
Windows I know corresponding function (InterlockedExchange) exists.
I could not see any other implementation of same.
I think this at least deserves comment explaining why we have done
the implementation this way.
10.
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
The function name and input value seems to be inconsistent.
The function name contains *_u32*, however the input value is *int32*.
11.
Both pg_atomic_fetch_and_u32_impl() and pg_atomic_fetch_or_u32_impl() are
implemented using compare_exchange routines, don't we have any native API's
which we can use for implementation.
12.
I am not able to see API's pg_atomic_add_fetch_u32(), pg_atomic_sub_fetch_u32()
in C11 atomic ops specs, do you need it for something specific?
13.
+ * The non-intrinsics version is only available in vista upwards, so use the
+ * intrisc version.
spelling of intrinsic is wrong.
14.
* pg_memory_barrier - prevent the CPU from reordering memory access
*
* A memory barrier must act as a compiler barrier,
Is it ensured in all cases that pg_memory_barrier implies compiler barrier
as well?
Example for msvc case, the specs doesn't seem to guarntee it.
I understand that this rule (or assumption) is carried forward from existing
implementation, however I am not aware if it is true for all cases, so thought
of mentioning here even though we don't want to do anything considering it an
existing behaviour.
Links:
On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> >> wrote: > > Further review of patch: > 1. > /* > * pg_atomic_test_and_set_flag - TAS() > * > * Acquire/read barrier semantics. > */ > STATIC_IF_INLINE_DECLARE bool > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); > > a. I think Acquire and read barrier semantics aren't equal. > With acquire semantics, "the results of the operation are available before > the > results of any operation that appears after it in code" which means it > applies > for both load and stores. Definition of read barrier just ensures loads. > > So will be right to declare like above in comments? Yeah. Barriers can be by the operations they keep from being reordered (read, write, or both) and by the direction in which they prevent reordering (acquire, release, or both). So in theory there are 9 combinations: read barrier (both directions) read barrier (acquire) read barrier (release) write barrier (both directions) write barrier (acquire) write barrier (both directions) full barrier (both directions) full barrier (acquire) full barrier (release) To make things more confusing, a read barrier plus a write barrier need not equal a full barrier. Read barriers prevent reads from being reordered with other reads, and write barriers keep writes from being reordered with other writes, but neither prevents a read from being reordered relative to a write. A full barrier, however, must prevent all reordering: read/read, read/write, write/read, and write/write. Clarity here is essential. barrier.h only proposed to implement the following: read barrier (both directions), write barrier (both directions), full barrier (both directions) The reason I did that is because it seemed to be more or less what Linux was doing, and it's generally suitable for lock-less algorithms. The acquire and release semantics come into play specifically when you're using barriers to implement locking primitives, which is isn't what I was trying to enable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 9, 2014 at 8:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> >
> > Further review of patch:
> > 1.
> > /*
> > * pg_atomic_test_and_set_flag - TAS()
> > *
> > * Acquire/read barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> >
> > a. I think Acquire and read barrier semantics aren't equal.
> > With acquire semantics, "the results of the operation are available before
> > the
> > results of any operation that appears after it in code" which means it
> > applies
> > for both load and stores. Definition of read barrier just ensures loads.
> >
> > So will be right to declare like above in comments?
>
> Yeah. Barriers can be by the operations they keep from being
> reordered (read, write, or both) and by the direction in which they
> prevent reordering (acquire, release, or both). So in theory there
> are 9 combinations:
>
> read barrier (both directions)
> read barrier (acquire)
> read barrier (release)
> write barrier (both directions)
> write barrier (acquire)
> write barrier (both directions)
> full barrier (both directions)
> full barrier (acquire)
> full barrier (release)
With these definitions, I think it will fall into *full barrier (acquire)*
category as it needs to ensure that no reordering should happen
> On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> >
> > Further review of patch:
> > 1.
> > /*
> > * pg_atomic_test_and_set_flag - TAS()
> > *
> > * Acquire/read barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> >
> > a. I think Acquire and read barrier semantics aren't equal.
> > With acquire semantics, "the results of the operation are available before
> > the
> > results of any operation that appears after it in code" which means it
> > applies
> > for both load and stores. Definition of read barrier just ensures loads.
> >
> > So will be right to declare like above in comments?
>
> Yeah. Barriers can be by the operations they keep from being
> reordered (read, write, or both) and by the direction in which they
> prevent reordering (acquire, release, or both). So in theory there
> are 9 combinations:
>
> read barrier (both directions)
> read barrier (acquire)
> read barrier (release)
> write barrier (both directions)
> write barrier (acquire)
> write barrier (both directions)
> full barrier (both directions)
> full barrier (acquire)
> full barrier (release)
With these definitions, I think it will fall into *full barrier (acquire)*
category as it needs to ensure that no reordering should happen
for both load and store. As an example, we can refer its
implementation for msvc which ensures that it follows full memory
barrier semantics.
> To make things more confusing, a read barrier plus a write barrier
> need not equal a full barrier. Read barriers prevent reads from being
> reordered with other reads, and write barriers keep writes from being
> reordered with other writes, but neither prevents a read from being
> reordered relative to a write. A full barrier, however, must prevent
> all reordering: read/read, read/write, write/read, and write/write.
>
> Clarity here is essential.
>
> barrier.h only proposed to implement the following:
>
> read barrier (both directions), write barrier (both directions), full
> barrier (both directions)
As per my understanding of the general theory around barriers,
> The reason I did that is because it seemed to be more or less what
> Linux was doing, and it's generally suitable for lock-less algorithms.
> The acquire and release semantics come into play specifically when
> you're using barriers to implement locking primitives, which is isn't
> what I was trying to enable.
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> To make things more confusing, a read barrier plus a write barrier
> need not equal a full barrier. Read barriers prevent reads from being
> reordered with other reads, and write barriers keep writes from being
> reordered with other writes, but neither prevents a read from being
> reordered relative to a write. A full barrier, however, must prevent
> all reordering: read/read, read/write, write/read, and write/write.
>
> Clarity here is essential.
>
> barrier.h only proposed to implement the following:
>
> read barrier (both directions), write barrier (both directions), full
> barrier (both directions)
As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and
full memory barriers are defined to avoid reordering due to processors.
There are some processors that support instructions for memory
barriers with acquire, release and fence semantics.
I think the current definitions of barriers is as per needs of usage in
PostgreSQL and not by referring standard (am I missing something
here), however as you explained the definitions are quite clear.
> The reason I did that is because it seemed to be more or less what
> Linux was doing, and it's generally suitable for lock-less algorithms.
> The acquire and release semantics come into play specifically when
> you're using barriers to implement locking primitives, which is isn't
> what I was trying to enable.
Now by implementing atomic-ops, I think we are moving more towards
lock-less algorithms, so I am not sure if we just want to adhere to
existing definitions as you listed above (what current barrier.h
implements) and implement accordingly or we can extend the existing
definitions.
On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote: > As per my understanding of the general theory around barriers, > read and write are defined to avoid reordering due to compiler and > full memory barriers are defined to avoid reordering due to processors. > There are some processors that support instructions for memory > barriers with acquire, release and fence semantics. Not exactly, both compilers and processors can rearrange loads and stores. Because the architecture most developers work on (x86) has such a strong memory model (it's goes to lot of effort to hide reordering) people are under the impression that it's only the compiler you need to worry about, but that's not true for any other architechture. Memory barriers/fences are on the whole discouraged if you can use acquire/release semantics because the latter are faster and much easier to optimise for. I strongly recommend the "Atomic Weapons" talk on the C11 memory model to help understand how they work. As a bonus it includes correct implementations for the major architectures. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Sat, Jul 12, 2014 at 1:26 AM, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
> > As per my understanding of the general theory around barriers,
> > read and write are defined to avoid reordering due to compiler and
> > full memory barriers are defined to avoid reordering due to processors.
> > There are some processors that support instructions for memory
> > barriers with acquire, release and fence semantics.
>
> Not exactly, both compilers and processors can rearrange loads and
> stores. Because the architecture most developers work on (x86) has
> such a strong memory model (it's goes to lot of effort to hide
> reordering) people are under the impression that it's only the compiler
> you need to worry about, but that's not true for any other
> architechture.
I am not saying that we need only barriers to avoid reordering due to
> I strongly recommend the "Atomic Weapons" talk on the C11 memory model
> to help understand how they work. As a bonus it includes correct
> implementations for the major architectures.
Yes that will be a good if we can can implement as per C11 memory model and
> On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
> > As per my understanding of the general theory around barriers,
> > read and write are defined to avoid reordering due to compiler and
> > full memory barriers are defined to avoid reordering due to processors.
> > There are some processors that support instructions for memory
> > barriers with acquire, release and fence semantics.
>
> Not exactly, both compilers and processors can rearrange loads and
> stores. Because the architecture most developers work on (x86) has
> such a strong memory model (it's goes to lot of effort to hide
> reordering) people are under the impression that it's only the compiler
> you need to worry about, but that's not true for any other
> architechture.
I am not saying that we need only barriers to avoid reordering due to
compiler, rather my point was that we need to avoid reordering due to
both compiler and processor, however ways to achieve it require different
API's
> Memory barriers/fences are on the whole discouraged if you can use
> acquire/release semantics because the latter are faster and much easier
> to optimise for.
Do all processors support instructions for memory barriers with acquire,
> Memory barriers/fences are on the whole discouraged if you can use
> acquire/release semantics because the latter are faster and much easier
> to optimise for.
Do all processors support instructions for memory barriers with acquire,
release semantics?
> I strongly recommend the "Atomic Weapons" talk on the C11 memory model
> to help understand how they work. As a bonus it includes correct
> implementations for the major architectures.
Yes that will be a good if we can can implement as per C11 memory model and
thats what I am referring while reviewing this patch, however even if that is not
possible or difficult to achieve in all cases, it is worth to have a discussion for
memory model used in current API's implemented in patch.
Hi Amit, On 2014-07-08 11:51:14 +0530, Amit Kapila wrote: > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > Further review of patch: > 1. > /* > * pg_atomic_test_and_set_flag - TAS() > * > * Acquire/read barrier semantics. > */ > STATIC_IF_INLINE_DECLARE bool > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr); > > a. I think Acquire and read barrier semantics aren't equal. Right. It was mean as Acquire (thus including read barrier) semantics. I guess I'll better update README.barrier to have definitions of all barriers. > b. I think it adheres to Acquire semantics for platforms where > that semantics > are supported else it defaults to full memory ordering. > Example : > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) > > Is it right to declare that generic version of function adheres to > Acquire semantics. Implementing stronger semantics than required should always be fine. There's simply no sane way to work with the variety of platforms we support in any other way. > > 2. > bool > pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) > { > return TAS((slock_t *) &ptr->sema); > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) Where's that from? I can't recall adding a line of code like that? > a. In above usage (ptr->sema), sema itself is not declared as volatile, > so would it be right to use it in this way for API > (InterlockedCompareExchange) > expecting volatile. Upgrading a pointer to volatile is defined, so I don't see the problem? > 3. > /* > * pg_atomic_unlocked_test_flag - TAS() > * > * No barrier semantics. > */ > STATIC_IF_INLINE_DECLARE bool > pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr); > > a. There is no setting in this, so using TAS in function comments > seems to be wrong. Good point. > b. BTW, I am not able see this function in C11 specs. So? It's needed for a sensible implementation of spinlocks ontop of atomics. > 4. > #if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && > defined(PG_HAVE_ATOMIC_EXCHANGE_U32) > .. > #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG > static inline bool > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) > { > return pg_atomic_read_u32_impl(ptr) == 1; > } > > > #elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && > defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) > .. > #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG > static inline bool > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) > { > return (bool) pg_atomic_read_u32_impl(ptr); > } > > Is there any reason to keep these two implementations different? No, will change. > 5. > atomics-generic-gcc.h > #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG > #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG > static inline bool > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) > { > return ptr->value == 0; > } > #endif > > Don't we need to compare it with 1 instead of 0? Why? It returns true if the lock is free, makes sense imo? Will add comments to atomics.h > 6. > atomics-fallback.h > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) > { > /* > * Can't do this in the semaphore based implementation - always return > * true. Do this in the header so compilers can optimize the test away. > */ > return true; > } > > Why we can't implement this in semaphore based implementation? Because semaphores don't provide the API for it? > 7. > /* > * pg_atomic_clear_flag - release lock set by TAS() > * > * Release/write barrier semantics. > */ > STATIC_IF_INLINE_DECLARE void > pg_atomic_clear_flag(volatile pg_atomic_flag *ptr); > > a. Are Release and write barriers equivalent? Same answer as above. Meant as "Release (thus including write barrier) semantics" > b. IIUC then current code doesn't have release semantics for unlock/clear. If you're referring to s_lock.h with 'current code', yes it should have:* On platforms with weak memory ordering, theTAS(), TAS_SPIN(), and* S_UNLOCK() macros must further include hardware-level memory fence* instructions to preventsimilar re-ordering at the hardware level.* TAS() and TAS_SPIN() must guarantee that loads and stores issued after* the macro are not executed until the lock has been obtained. Conversely,* S_UNLOCK() must guarantee that loadsand stores issued before the macro* have been executed before the lock is released. > We are planing to introduce it in this patch, also this doesn't follow the > specs which says that clear should have memory_order_seq_cst ordering > semantics. Making it guarantee full memory barrier semantics is a major performance loss on x86-64, so I don't want to do that. Also there is atomic_flag_test_and_set_explicit()... > As per its current usage in patch (S_UNLOCK), it seems reasonable > to have *release* semantics for this API, however I think if we use it for > generic purpose then it might not be right to define it with Release > semantics. I can't really see a user where it's not what you want. And if there is - well, it can make the semantics stronger if it needs that. > 8. > #define PG_HAS_ATOMIC_CLEAR_FLAG > static inline void > pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr) > { > /* XXX: release semantics suffice? */ > pg_memory_barrier_impl(); > pg_atomic_write_u32_impl(ptr, 0); > } > > Are we sure that having memory barrier before clearing flag will > achieve *Release* semantics; as per my understanding the definition > of Release sematics is as below and above doesn't seem to follow the > same. Yes, a memory barrier guarantees a release semantics. It guarantees more, but that's not a problem. > 9. > static inline uint32 > pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_) > { > uint32 old; > while (true) > { > old = pg_atomic_read_u32_impl(ptr); > if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_)) > break; > } > return old; > } > > What is the reason to implement exchange as compare-and-exchange, at least > for > Windows I know corresponding function (InterlockedExchange) exists. > I could not see any other implementation of same. > I think this at least deserves comment explaining why we have done > the implementation this way. Because Robert and Tom insist that we shouldn't add more operations employing hardware features. I think that's ok for now, we can always add more capabilities later. > 10. > STATIC_IF_INLINE_DECLARE uint32 > pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_); > STATIC_IF_INLINE_DECLARE uint32 > pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_); > > The function name and input value seems to be inconsistent. > The function name contains *_u32*, however the input value is *int32*. A u32 is implemented by adding/subtracting a signed int. There's some platforms why that's the only API provided by intrinsics and it's good enough for pretty much everything. > 11. > Both pg_atomic_fetch_and_u32_impl() and pg_atomic_fetch_or_u32_impl() are > implemented using compare_exchange routines, don't we have any native API's > which we can use for implementation. Same reason as for 9). > 12. > I am not able to see API's pg_atomic_add_fetch_u32(), > pg_atomic_sub_fetch_u32() > in C11 atomic ops specs, do you need it for something specific? Yes, it's already used in the lwlocks code and it's provided by several other atomics libraries. I don't really see a reason not to provide it, especially as some platforms could implement it more efficiently. > 14. > * pg_memory_barrier - prevent the CPU from reordering memory access > * > * A memory barrier must act as a compiler barrier, > > Is it ensured in all cases that pg_memory_barrier implies compiler barrier > as well? Yes. > Example for msvc case, the specs doesn't seem to guarntee it. I think it actually indirectly guarantees it, but I'm on a plane and can't check :) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-07-10 08:46:55 +0530, Amit Kapila wrote: > As per my understanding of the general theory around barriers, > read and write are defined to avoid reordering due to compiler and > full memory barriers are defined to avoid reordering due to > processors. No, that's not the case. There's several processors where write/read barriers are an actual thing on the hardware level. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jul 14, 2014 at 12:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
> > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > Further review of patch:
> > 1.
> > /*
> > * pg_atomic_test_and_set_flag - TAS()
> > *
> > * Acquire/read barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> >
> > a. I think Acquire and read barrier semantics aren't equal.
>
> Right. It was mean as Acquire (thus including read barrier) semantics.
> > b. I think it adheres to Acquire semantics for platforms where
> > that semantics
> > are supported else it defaults to full memory ordering.
> > Example :
> > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> >
> > Is it right to declare that generic version of function adheres to
> > Acquire semantics.
>
> Implementing stronger semantics than required should always be
> fine. There's simply no sane way to work with the variety of platforms
> we support in any other way.
> >
> > 2.
> > bool
> > pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > return TAS((slock_t *) &ptr->sema);
> > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
>
> Where's that from?
>
> > 8.
> > #define PG_HAS_ATOMIC_CLEAR_FLAG
> > static inline void
> > pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > /* XXX: release semantics suffice? */
> > pg_memory_barrier_impl();
> > pg_atomic_write_u32_impl(ptr, 0);
> > }
> >
> > Are we sure that having memory barrier before clearing flag will
> > achieve *Release* semantics; as per my understanding the definition
> > of Release sematics is as below and above doesn't seem to follow the
> > same.
>
> Yes, a memory barrier guarantees a release semantics. It guarantees
> more, but that's not a problem.
You are right.
> > 9.
> > static inline uint32
> > pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
> > {
> > uint32 old;
> > while (true)
> > {
> > old = pg_atomic_read_u32_impl(ptr);
> > if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
> > break;
> > }
> > return old;
> > }
> >
> > What is the reason to implement exchange as compare-and-exchange, at least
> > for
> > Windows I know corresponding function (InterlockedExchange) exists.
> > I could not see any other implementation of same.
> > I think this at least deserves comment explaining why we have done
> > the implementation this way.
>
> Because Robert and Tom insist that we shouldn't add more operations
> employing hardware features. I think that's ok for now, we can always
> add more capabilities later.
No issues, may be a comment indicating some form of reason would be good.
> > 10.
> > STATIC_IF_INLINE_DECLARE uint32
> > pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
> > STATIC_IF_INLINE_DECLARE uint32
> > pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
> >
> > The function name and input value seems to be inconsistent.
> > The function name contains *_u32*, however the input value is *int32*.
>
> A u32 is implemented by adding/subtracting a signed int. There's some
> platforms why that's the only API provided by intrinsics and it's good
> enough for pretty much everything.
>
> > 12.
> > I am not able to see API's pg_atomic_add_fetch_u32(),
> > pg_atomic_sub_fetch_u32()
> > in C11 atomic ops specs, do you need it for something specific?
>
> Yes, it's already used in the lwlocks code and it's provided by several
> other atomics libraries. I don't really see a reason not to provide it,
> especially as some platforms could implement it more efficiently.
> > 14.
> > * pg_memory_barrier - prevent the CPU from reordering memory access
> > *
> > * A memory barrier must act as a compiler barrier,
> >
> > Is it ensured in all cases that pg_memory_barrier implies compiler barrier
> > as well?
>
> Yes.
>
> > Example for msvc case, the specs doesn't seem to guarntee it.
>
> I think it actually indirectly guarantees it, but I'm on a plane and
> can't check :)
No problem let me know when you are comfortable, this is just for
> On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
> > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > Further review of patch:
> > 1.
> > /*
> > * pg_atomic_test_and_set_flag - TAS()
> > *
> > * Acquire/read barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> >
> > a. I think Acquire and read barrier semantics aren't equal.
>
> Right. It was mean as Acquire (thus including read barrier) semantics.
Okay, I got confused by reading comments, may be saying the same
explicitly in comments is better.
> I guess I'll better update README.barrier to have definitions of all
> barriers.
I think updating about the reasons behind using specific
barriers for atomic operations defined by PG can be useful
for people who want to understand or use the atomic op API's.
> > that semantics
> > are supported else it defaults to full memory ordering.
> > Example :
> > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> >
> > Is it right to declare that generic version of function adheres to
> > Acquire semantics.
>
> Implementing stronger semantics than required should always be
> fine. There's simply no sane way to work with the variety of platforms
> we support in any other way.
Agreed, may be we can have a note somewhere either in code or
readme to mention about it, if you think that can be helpful.
> >
> > 2.
> > bool
> > pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > return TAS((slock_t *) &ptr->sema);
> > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
>
> Where's that from?
Its there in s_lock.h. Refer below code:
#ifdef WIN32_ONLY_COMPILER
typedef LONG slock_t;
#define HAS_TEST_AND_SET
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> I can't recall adding a line of code like that?
> > a. In above usage (ptr->sema), sema itself is not declared as volatile,
> > so would it be right to use it in this way for API
> > (InterlockedCompareExchange)
> > expecting volatile.
>
> Upgrading a pointer to volatile is defined, so I don't see the problem?
> > 3.
> > /*
> > * pg_atomic_unlocked_test_flag - TAS()
> > *
> > * No barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
> >
> > a. There is no setting in this, so using TAS in function comments
> > seems to be wrong.
>
> Good point.
>
> > b. BTW, I am not able see this function in C11 specs.
>
> So? It's needed for a sensible implementation of spinlocks ontop of
> atomics.
> > 5.
> > atomics-generic-gcc.h
> > #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> > #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> > static inline bool
> > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > return ptr->value == 0;
> > }
> > #endif
> >
> > Don't we need to compare it with 1 instead of 0?
>
> Why? It returns true if the lock is free, makes sense imo?
> > 6.
> > atomics-fallback.h
> > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > /*
> > * Can't do this in the semaphore based implementation - always return
> > * true. Do this in the header so compilers can optimize the test away.
> > */
> > return true;
> > }
> >
> > Why we can't implement this in semaphore based implementation?
>
> Because semaphores don't provide the API for it?
> > 7.
> > /*
> > * pg_atomic_clear_flag - release lock set by TAS()
> > *
> > * Release/write barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE void
> > pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
> >
> > a. Are Release and write barriers equivalent?
>
> Same answer as above. Meant as "Release (thus including write barrier) semantics"
>
> > b. IIUC then current code doesn't have release semantics for unlock/clear.
>
> If you're referring to s_lock.h with 'current code', yes it should have:
It is not added by your patch but used in your patch.
I am not sure if that is what excatly you want atomic API to define
for WIN32, but I think it is picking this definition. I have one question
here, if the value of sema is other than 1 or 0, then it won't set the flag
and not sure what it will return (ex. if the value is negative), so is this
implementation completely sane?
> > a. In above usage (ptr->sema), sema itself is not declared as volatile,
> > so would it be right to use it in this way for API
> > (InterlockedCompareExchange)
> > expecting volatile.
>
> Upgrading a pointer to volatile is defined, so I don't see the problem?
Thats okay. However all other structure definitions use it as
volatile, example for atomics-arch-amd64.h it is defined as follows:
#define PG_HAVE_ATOMIC_FLAG_SUPPORT
typedef struct pg_atomic_flag
{
volatile char value;
} pg_atomic_flag;
So is there any harm if we keep this definition in sync with others?
> > 3.
> > /*
> > * pg_atomic_unlocked_test_flag - TAS()
> > *
> > * No barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
> >
> > a. There is no setting in this, so using TAS in function comments
> > seems to be wrong.
>
> Good point.
>
> > b. BTW, I am not able see this function in C11 specs.
>
> So? It's needed for a sensible implementation of spinlocks ontop of
> atomics.
Okay got the point, do you think it makes sense to have a common
agreement/consensus w.r.t which API's are necessary as a first
cut. For me as a reviewer, if the API is implemented as per specs or
what it is desired to then we can have it, however I am not sure if there
is general consensus or at least some people agreed to set of API's
which are required for first cut, have I missed some conclusion/discussion
about it.
>> > 5.
> > atomics-generic-gcc.h
> > #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> > #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> > static inline bool
> > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > return ptr->value == 0;
> > }
> > #endif
> >
> > Don't we need to compare it with 1 instead of 0?
>
> Why? It returns true if the lock is free, makes sense imo?
Other implementation in atomics-generic.h seems to implement it other
way
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}
> Will add comments to atomics.h
I think that will be helpful.
> > 6.
> > atomics-fallback.h
> > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > /*
> > * Can't do this in the semaphore based implementation - always return
> > * true. Do this in the header so compilers can optimize the test away.
> > */
> > return true;
> > }
> >
> > Why we can't implement this in semaphore based implementation?
>
> Because semaphores don't provide the API for it?
Okay, but it is not clear from comments why this test should always
return true, basically I am not able to think why incase of missing API
returning true is correct behaviour for this API?
> > 7.
> > /*
> > * pg_atomic_clear_flag - release lock set by TAS()
> > *
> > * Release/write barrier semantics.
> > */
> > STATIC_IF_INLINE_DECLARE void
> > pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
> >
> > a. Are Release and write barriers equivalent?
>
> Same answer as above. Meant as "Release (thus including write barrier) semantics"
>
> > b. IIUC then current code doesn't have release semantics for unlock/clear.
>
> If you're referring to s_lock.h with 'current code', yes it should have:
I was referring to next point (point number 8) which you have clarified below.
> > 8.
> > #define PG_HAS_ATOMIC_CLEAR_FLAG
> > static inline void
> > pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > /* XXX: release semantics suffice? */
> > pg_memory_barrier_impl();
> > pg_atomic_write_u32_impl(ptr, 0);
> > }
> >
> > Are we sure that having memory barrier before clearing flag will
> > achieve *Release* semantics; as per my understanding the definition
> > of Release sematics is as below and above doesn't seem to follow the
> > same.
>
> Yes, a memory barrier guarantees a release semantics. It guarantees
> more, but that's not a problem.
You are right.
> > 9.
> > static inline uint32
> > pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
> > {
> > uint32 old;
> > while (true)
> > {
> > old = pg_atomic_read_u32_impl(ptr);
> > if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
> > break;
> > }
> > return old;
> > }
> >
> > What is the reason to implement exchange as compare-and-exchange, at least
> > for
> > Windows I know corresponding function (InterlockedExchange) exists.
> > I could not see any other implementation of same.
> > I think this at least deserves comment explaining why we have done
> > the implementation this way.
>
> Because Robert and Tom insist that we shouldn't add more operations
> employing hardware features. I think that's ok for now, we can always
> add more capabilities later.
No issues, may be a comment indicating some form of reason would be good.
> > 10.
> > STATIC_IF_INLINE_DECLARE uint32
> > pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
> > STATIC_IF_INLINE_DECLARE uint32
> > pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
> >
> > The function name and input value seems to be inconsistent.
> > The function name contains *_u32*, however the input value is *int32*.
>
> A u32 is implemented by adding/subtracting a signed int. There's some
> platforms why that's the only API provided by intrinsics and it's good
> enough for pretty much everything.
As such no issues, but just wondering is there any reason to prefer
*_u32 instead of simple _32 or *32.
>
> > 12.
> > I am not able to see API's pg_atomic_add_fetch_u32(),
> > pg_atomic_sub_fetch_u32()
> > in C11 atomic ops specs, do you need it for something specific?
>
> Yes, it's already used in the lwlocks code and it's provided by several
> other atomics libraries. I don't really see a reason not to provide it,
> especially as some platforms could implement it more efficiently.
Okay, I find below implementation in your patch. Won't it add the value
twice?
+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U32) && defined(PG_HAS_ATOMIC_FETCH_ADD_U32)
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+static inline uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif
> > 14.
> > * pg_memory_barrier - prevent the CPU from reordering memory access
> > *
> > * A memory barrier must act as a compiler barrier,
> >
> > Is it ensured in all cases that pg_memory_barrier implies compiler barrier
> > as well?
>
> Yes.
>
> > Example for msvc case, the specs doesn't seem to guarntee it.
>
> I think it actually indirectly guarantees it, but I'm on a plane and
> can't check :)
No problem let me know when you are comfortable, this is just for
Hi Andres, Are you planning to continue working on this? Summarizing the discussion so far: * Robert listed a bunch of little cleanup tasks (http://www.postgresql.org/message-id/CA+TgmoZShVVqjaKUL6P3kDvZVPibtgkzoti3M+Fvvjg5V+XJ0A@mail.gmail.com). Amit posted yet more detailed commends. * We talked about changing the file layout. I think everyone is happy with your proposal here: http://www.postgresql.org/message-id/20140702131729.GP21169@awork2.anarazel.de, with an overview description of what goes where (http://www.postgresql.org/message-id/CA+TgmoYuxfsm2dSy48=JgUTRh1mozrvmjLHgqrFkU7_WPV-xLA@mail.gmail.com) * Talked about nested spinlocks. The consensus seems to be to (continue to) forbid nested spinlocks, but allow atomic operations while holding a spinlock. When atomics are emulated with spinlocks, it's OK to acquire the emulation spinlock while holding another spinlock. * Talked about whether emulating atomics with spinlocks is a good idea. You posted performance results showing that at least with the patch to use atomics to implement LWLocks, the emulation performs more or less the same as the current spinlock-based implementation. I believe everyone was more or less satisfied with that. So ISTM we have consensus that the approach to spinlock emulation and nesting stuff is OK. To finish the patch, you'll just need to address the file layout and the laundry list of other little details that have been raised. - Heikki
Hi, On 2014-08-20 12:43:05 +0300, Heikki Linnakangas wrote: > Are you planning to continue working on this? Yes, I am! I've recently been on holiday and now I'm busy with catching up with everything that has happened since. I hope to have a next version ready early next week. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, I've finally managed to incorporate (all the?) feedback I got for 0.5. Imo the current version looks pretty good. Most notable changes: * Lots of comment improvements * code moved out of storage/ into port/ * separated the s_lock.h changes into its own commit * merged i386/amd64 into one file * fixed lots of the little details Amit noticed * fixed lots of XXX/FIXMEs * rebased to today's master * tested various gcc/msvc versions * extended the regression tests * ... The patches: 0001: The actual atomics API 0002: Implement s_lock.h support ontop the atomics API. Existing implementations continue to be used unless FORCE_ATOMICS_BASED_SPINLOCKS is defined 0003-0005: Not proposed for review here. Just included because code actually using the atomics make testing them easier. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Add-a-basic-atomic-ops-API-abstracting-away-platform.patch
- 0002-Add-implementation-of-spinlocks-using-the-atomics.h-.patch
- 0003-XXX-Hack-ShmemAlloc-to-align-all-allocations-to-cach.patch
- 0004-Convert-the-PGPROC-lwWaitLink-list-into-a-dlist-inst.patch
- 0005-Wait-free-LW_SHARED-lwlock-acquiration.patch
23.09.2014, 00:01, Andres Freund kirjoitti: > I've finally managed to incorporate (all the?) feedback I got for > 0.5. Imo the current version looks pretty good. > > Most notable changes: > * Lots of comment improvements > * code moved out of storage/ into port/ > * separated the s_lock.h changes into its own commit > * merged i386/amd64 into one file > * fixed lots of the little details Amit noticed > * fixed lots of XXX/FIXMEs > * rebased to today's master > * tested various gcc/msvc versions > * extended the regression tests > * ... > > The patches: > 0001: The actual atomics API I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal dingo) with this patch but regression tests failed due to: /export/home/os/postgresql/src/test/regress/regress.so: symbol pg_write_barrier_impl: referenced symbol not found which turns out to be caused by a leftover PG_ prefix in ifdefs for HAVE_GCC__ATOMIC_INT64_CAS. Removing the PG_ prefix fixed the build and regression tests. Attached a patch to strip the invalid prefix. > 0002: Implement s_lock.h support ontop the atomics API. Existing > implementations continue to be used unless > FORCE_ATOMICS_BASED_SPINLOCKS is defined Applied this and built PG with and without FORCE_ATOMICS_BASED_SPINLOCKS - both builds passed regression tests. > 0003-0005: Not proposed for review here. Just included because code > actually using the atomics make testing them easier. I'll look at these patches later. / Oskari
Attachment
On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote: > 23.09.2014, 00:01, Andres Freund kirjoitti: > >I've finally managed to incorporate (all the?) feedback I got for > >0.5. Imo the current version looks pretty good. > > > >Most notable changes: > >* Lots of comment improvements > >* code moved out of storage/ into port/ > >* separated the s_lock.h changes into its own commit > >* merged i386/amd64 into one file > >* fixed lots of the little details Amit noticed > >* fixed lots of XXX/FIXMEs > >* rebased to today's master > >* tested various gcc/msvc versions > >* extended the regression tests > >* ... > > > >The patches: > >0001: The actual atomics API > > I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal > dingo) with this patch but regression tests failed due to: > > /export/home/os/postgresql/src/test/regress/regress.so: symbol > pg_write_barrier_impl: referenced symbol not found > > which turns out to be caused by a leftover PG_ prefix in ifdefs for > HAVE_GCC__ATOMIC_INT64_CAS. Removing the PG_ prefix fixed the build and > regression tests. Attached a patch to strip the invalid prefix. Gah. Stupid last minute changes... Thanks for diagnosing. Will integrate. > >0002: Implement s_lock.h support ontop the atomics API. Existing > > implementations continue to be used unless > > FORCE_ATOMICS_BASED_SPINLOCKS is defined > > Applied this and built PG with and without FORCE_ATOMICS_BASED_SPINLOCKS - > both builds passed regression tests. Cool. > >0003-0005: Not proposed for review here. Just included because code > > actually using the atomics make testing them easier. > > I'll look at these patches later. Cool. Although I'm not proposing them to be integrated as-is... Greetings, Andres Freund
Hi, On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote: > 23.09.2014, 00:01, Andres Freund kirjoitti: > >I've finally managed to incorporate (all the?) feedback I got for > >0.5. Imo the current version looks pretty good. > > > >Most notable changes: > >* Lots of comment improvements > >* code moved out of storage/ into port/ > >* separated the s_lock.h changes into its own commit > >* merged i386/amd64 into one file > >* fixed lots of the little details Amit noticed > >* fixed lots of XXX/FIXMEs > >* rebased to today's master > >* tested various gcc/msvc versions > >* extended the regression tests > >* ... > > > >The patches: > >0001: The actual atomics API > > I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal > dingo) with this patch but regression tests failed due to: Btw, if you could try sun studio it'd be great. I wrote the support for it blindly, and I'd be surprised if I got it right on the first try. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
23.09.2014, 15:18, Andres Freund kirjoitti: > On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote: >> 23.09.2014, 00:01, Andres Freund kirjoitti: >>> The patches: >>> 0001: The actual atomics API >> >> I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal >> dingo) with this patch but regression tests failed due to: > > Btw, if you could try sun studio it'd be great. I wrote the support for > it blindly, and I'd be surprised if I got it right on the first try. I just installed Solaris Studio 12.3 and tried compiling this: "../../../../src/include/port/atomics/generic-sunpro.h", line 54: return value type mismatch "../../../../src/include/port/atomics/generic-sunpro.h", line 77: return value type mismatch "../../../../src/include/port/atomics/generic-sunpro.h", line 79: #if-less #endif "../../../../src/include/port/atomics/generic-sunpro.h", line 81: #if-less #endif atomic_add_64 and atomic_add_32 don't return anything (the atomic_add_*_nv variants return the new value) and there were a few extra #endifs. Regression tests pass after applying the attached patch which defines PG_HAS_ATOMIC_ADD_FETCH_U32. There doesn't seem to be a fallback for defining pg_atomic_fetch_add based on pg_atomic_add_fetch so pg_atomic_add_fetch now gets implemented using pg_atomic_compare_exchange. Also, it's not possible to compile PG with FORCE_ATOMICS_BASED_SPINLOCKS with these patches: "../../../../src/include/storage/s_lock.h", line 868: #error: PostgreSQL does not have native spinlock support on this platform.... atomics/generic.h would implement atomic flags using operations exposed by atomics/generic-sunpro.h, but atomics/fallback.h is included before it and it defines functions for flag operations which s_lock.h doesn't want to use. / Oskari
Attachment
On 2014-09-24 00:27:25 +0300, Oskari Saarenmaa wrote: > 23.09.2014, 15:18, Andres Freund kirjoitti: > >On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote: > >>23.09.2014, 00:01, Andres Freund kirjoitti: > >>>The patches: > >>>0001: The actual atomics API > >> > >>I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal > >>dingo) with this patch but regression tests failed due to: > > > >Btw, if you could try sun studio it'd be great. I wrote the support for > >it blindly, and I'd be surprised if I got it right on the first try. > > I just installed Solaris Studio 12.3 and tried compiling this: Cool. > "../../../../src/include/port/atomics/generic-sunpro.h", line 54: return > value type mismatch > "../../../../src/include/port/atomics/generic-sunpro.h", line 77: return > value type mismatch > "../../../../src/include/port/atomics/generic-sunpro.h", line 79: #if-less > #endif > "../../../../src/include/port/atomics/generic-sunpro.h", line 81: #if-less > #endif > > atomic_add_64 and atomic_add_32 don't return anything (the atomic_add_*_nv > variants return the new value) and there were a few extra #endifs. > Regression tests pass after applying the attached patch which defines > PG_HAS_ATOMIC_ADD_FETCH_U32. Thanks for the fixes! Hm. I think then it's better to simply not implement addition and rely on cmpxchg. > Also, it's not possible to compile PG with FORCE_ATOMICS_BASED_SPINLOCKS > with these patches: > > "../../../../src/include/storage/s_lock.h", line 868: #error: PostgreSQL > does not have native spinlock support on this platform.... > > atomics/generic.h would implement atomic flags using operations exposed by > atomics/generic-sunpro.h, but atomics/fallback.h is included before it and > it defines functions for flag operations which s_lock.h doesn't want to use. Right. It should check not just for flag support, but also for 32bit atomics. Easily fixable. I'll send a new version with your fixes incorporated tomorrow. Thanks for looking into this! Very helpful. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/23/2014 12:01 AM, Andres Freund wrote: > Hi, > > I've finally managed to incorporate (all the?) feedback I got for > 0.5. Imo the current version looks pretty good. Thanks! I agree it looks good. Some random comments after a quick read-through: There are some spurious whitespace changes in spin.c. > + * These files can provide the full set of atomics or can do pretty much > + * nothing if all the compilers commonly used on these platforms provide > + * useable generics. It will often make sense to define memory barrier > + * semantics in those, since e.g. the compiler intrinsics for x86 memory > + * barriers can't know we don't need read/write barriers anything more than a > + * compiler barrier. I didn't understand the latter sentence. > + * Don't add an inline assembly of the actual atomic operations if all the > + * common implementations of your platform provide intrinsics. Intrinsics are > + * much easier to understand and potentially support more architectures. What about uncommon implementations, then? I think we need to provide inline assembly if any supported implementation on the platform does not provide intrinsics, otherwise we fall back to emulation. Or is that exactly what you're envisioning we do? I believe such a situation hasn't come up on the currently supported platforms, so we don't necessary have to have a solution for that, but the comment doesn't feel quite right either. > +#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \ > + Assert(TYPEALIGN((uintptr_t)(ptr), bndr)) Would be better to call this AssertAlignment, to make it clear that this is an assertion, not something like CHECK_FOR_INTERRUPTS(). And perhaps move this to c.h where other assertions are - this seems generally useful. > + * Spinloop delay - Spurious dash. > +/* > + * pg_fetch_add_until_u32 - saturated addition to variable > + * > + * Returns the the value of ptr after the arithmetic operation. > + * > + * Full barrier semantics. > + */ > +STATIC_IF_INLINE uint32 > +pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_, > + uint32 until) > +{ > + CHECK_POINTER_ALIGNMENT(ptr, 4); > + return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until); > +} > + This was a surprise to me, I don't recall discussion of an "fetch-add-until" operation, and hadn't actually ever heard of it before. None of the subsequent patches seem to use it either. Can we just leave this out? s/the the/the/ > +#ifndef PG_HAS_ATOMIC_WRITE_U64 > +#define PG_HAS_ATOMIC_WRITE_U64 > +static inline void > +pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val) > +{ > + /* > + * 64 bit writes aren't safe on all platforms. In the generic > + * implementation implement them as an atomic exchange. That might store a > + * 0, but only if the prev. value also was a 0. > + */ > + pg_atomic_exchange_u64_impl(ptr, val); > +} > +#endif Why is 0 special? > + * fail or suceed, but always return the old value s/suceed/succeed/. Add a full stop to end. > + /* > + * Can't run the test under the semaphore emulation, it doesn't handle > + * checking edge cases well. > + */ > +#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION > + test_atomic_flag(); > +#endif Huh? Is the semaphore emulation broken, then? - Heikki
On 2014-09-24 14:59:19 +0300, Heikki Linnakangas wrote: > On 09/23/2014 12:01 AM, Andres Freund wrote: > >Hi, > > > >I've finally managed to incorporate (all the?) feedback I got for > >0.5. Imo the current version looks pretty good. > > Thanks! I agree it looks good. Some random comments after a quick > read-through: > > There are some spurious whitespace changes in spin.c. Huh. Unsure how they crept in. Will fix. > >+ * These files can provide the full set of atomics or can do pretty much > >+ * nothing if all the compilers commonly used on these platforms provide > >+ * useable generics. It will often make sense to define memory barrier > >+ * semantics in those, since e.g. the compiler intrinsics for x86 memory > >+ * barriers can't know we don't need read/write barriers anything more than a > >+ * compiler barrier. > > I didn't understand the latter sentence. Hm. The background here is that postgres doesn't need memory barriers affect sse et al registers - which is why read/write barriers currently can be defined to be empty on TSO architectures like x86. But e.g. gcc's barrier intrinsics don't assume that, so they do a mfence or lock xaddl for read/write barriers. Which isn't cheap. Will try to find a way to rephrase. > >+ * Don't add an inline assembly of the actual atomic operations if all the > >+ * common implementations of your platform provide intrinsics. Intrinsics are > >+ * much easier to understand and potentially support more architectures. > > What about uncommon implementations, then? I think we need to provide inline > assembly if any supported implementation on the platform does not provide > intrinsics, otherwise we fall back to emulation. Or is that exactly what > you're envisioning we do? Yes, that's what I'm envisioning. E.g. old versions of solaris studio don't provide compiler intrinsics and also don't provide reliable ways to barrier semantics. It doesn't seem worthwile to add a inline assembly version for that special case. > I believe such a situation hasn't come up on the currently supported > platforms, so we don't necessary have to have a solution for that, but the > comment doesn't feel quite right either. It's the reason I added a inline assembly version of atomics for x86 gcc... > >+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \ > >+ Assert(TYPEALIGN((uintptr_t)(ptr), bndr)) > > Would be better to call this AssertAlignment, to make it clear that this is > an assertion, not something like CHECK_FOR_INTERRUPTS(). And perhaps move > this to c.h where other assertions are - this seems generally useful. Sounds good. > >+ * Spinloop delay - > > Spurious dash. Probably should rather add a bit more explanation... > >+/* > >+ * pg_fetch_add_until_u32 - saturated addition to variable > >+ * > >+ * Returns the the value of ptr after the arithmetic operation. > >+ * > >+ * Full barrier semantics. > >+ */ > >+STATIC_IF_INLINE uint32 > >+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_, > >+ uint32 until) > >+{ > >+ CHECK_POINTER_ALIGNMENT(ptr, 4); > >+ return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until); > >+} > >+ > > This was a surprise to me, I don't recall discussion of an "fetch-add-until" > operation, and hadn't actually ever heard of it before. It was included from the first version on, and I'd mentioned it a couple times. > None of the subsequent patches seem to use it either. Can we just > leave this out? I have a further patch (lockless buffer header manipulation) that uses it to manipulate the usagecount. I can add it back later if you feel strongly about it. > s/the the/the/ > > >+#ifndef PG_HAS_ATOMIC_WRITE_U64 > >+#define PG_HAS_ATOMIC_WRITE_U64 > >+static inline void > >+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val) > >+{ > >+ /* > >+ * 64 bit writes aren't safe on all platforms. In the generic > >+ * implementation implement them as an atomic exchange. That might store a > >+ * 0, but only if the prev. value also was a 0. > >+ */ > >+ pg_atomic_exchange_u64_impl(ptr, val); > >+} > >+#endif > > Why is 0 special? That bit should actually be in the read_u64 implementation, not write... If you do a compare and exchange you have to provide an 'expected' value. So, if you do a compare and exchange and the previous value is 0 the value will be written superflously - but the actual value won't change. > >+ /* > >+ * Can't run the test under the semaphore emulation, it doesn't handle > >+ * checking edge cases well. > >+ */ > >+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION > >+ test_atomic_flag(); > >+#endif > > Huh? Is the semaphore emulation broken, then? No, it's not. The semaphore emulation doesn't implement pg_atomic_unlocked_test_flag() (as there's no sane way to do that with semaphores that I know of). Instead it always returns true. Which disturbs the test. I guess I'll expand that comment... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/24/2014 03:37 PM, Andres Freund wrote: >>> > >+/* >>> > >+ * pg_fetch_add_until_u32 - saturated addition to variable >>> > >+ * >>> > >+ * Returns the the value of ptr after the arithmetic operation. >>> > >+ * >>> > >+ * Full barrier semantics. >>> > >+ */ >>> > >+STATIC_IF_INLINE uint32 >>> > >+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_, >>> > >+ uint32 until) >>> > >+{ >>> > >+ CHECK_POINTER_ALIGNMENT(ptr, 4); >>> > >+ return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until); >>> > >+} >>> > >+ >> > >> >This was a surprise to me, I don't recall discussion of an "fetch-add-until" >> >operation, and hadn't actually ever heard of it before. > It was included from the first version on, and I'd mentioned it a couple > times. There doesn't seem to be any hardware implementations of that in the patch. Is there any architecture that has an instruction or compiler intrinsic for that? - Heikki
On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote: > On 09/24/2014 03:37 PM, Andres Freund wrote: > >>>> >+/* > >>>> >+ * pg_fetch_add_until_u32 - saturated addition to variable > >>>> >+ * > >>>> >+ * Returns the the value of ptr after the arithmetic operation. > >>>> >+ * > >>>> >+ * Full barrier semantics. > >>>> >+ */ > >>>> >+STATIC_IF_INLINE uint32 > >>>> >+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_, > >>>> >+ uint32 until) > >>>> >+{ > >>>> >+ CHECK_POINTER_ALIGNMENT(ptr, 4); > >>>> >+ return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until); > >>>> >+} > >>>> >+ > >>> > >>>This was a surprise to me, I don't recall discussion of an "fetch-add-until" > >>>operation, and hadn't actually ever heard of it before. > >It was included from the first version on, and I'd mentioned it a couple > >times. > > There doesn't seem to be any hardware implementations of that in the patch. > Is there any architecture that has an instruction or compiler intrinsic for > that? You can implement it rather efficiently on ll/sc architectures. But I don't really think it matters. I prefer add_until (I've seen it named saturated add before as well) to live in the atomics code, rather than reimplement it in atomics employing code. I guess you see that differently? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote: >> There doesn't seem to be any hardware implementations of that in the patch. >> Is there any architecture that has an instruction or compiler intrinsic for >> that? > You can implement it rather efficiently on ll/sc architectures. But I > don't really think it matters. I prefer add_until (I've seen it named > saturated add before as well) to live in the atomics code, rather than > reimplement it in atomics employing code. I guess you see that > differently? I think the question is more like "what in the world happened to confining ourselves to a small set of atomics". I doubt either that this exists natively anywhere, or that it's so useful that we should expect platforms to have efficient implementations. regards, tom lane
On 2014-09-24 12:44:09 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote: > >> There doesn't seem to be any hardware implementations of that in the patch. > >> Is there any architecture that has an instruction or compiler intrinsic for > >> that? > > > You can implement it rather efficiently on ll/sc architectures. But I > > don't really think it matters. I prefer add_until (I've seen it named > > saturated add before as well) to live in the atomics code, rather than > > reimplement it in atomics employing code. I guess you see that > > differently? > > I think the question is more like "what in the world happened to confining > ourselves to a small set of atomics". I fail to see why the existance of a wrapper around compare-exchange (which is one of the primitives we'd agreed upon) runs counter to the agreement that we'll only rely on a limited number of atomics on the hardware level? > I doubt either that this exists > natively anywhere, or ethat it's so useful that we should expect platforms > to have efficient implementations. It's useful for my work to get rid of most LockBufHdr() calls (to manipulate usagecount locklessly). That's why I added it. We can delay it till that patch is ready, but I don't really see the benefit. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/24/2014 07:57 PM, Andres Freund wrote: > On 2014-09-24 12:44:09 -0400, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote: >>>> There doesn't seem to be any hardware implementations of that in the patch. >>>> Is there any architecture that has an instruction or compiler intrinsic for >>>> that? >> >>> You can implement it rather efficiently on ll/sc architectures. But I >>> don't really think it matters. I prefer add_until (I've seen it named >>> saturated add before as well) to live in the atomics code, rather than >>> reimplement it in atomics employing code. I guess you see that >>> differently? >> >> I think the question is more like "what in the world happened to confining >> ourselves to a small set of atomics". > > I fail to see why the existance of a wrapper around compare-exchange > (which is one of the primitives we'd agreed upon) runs counter to > the agreement that we'll only rely on a limited number of atomics on the > hardware level? It might be a useful function, but if there's no hardware implementation for it, it doesn't belong in atomics.h. We don't want to turn it into a general library of useful little functions. >> I doubt either that this exists >> natively anywhere, or ethat it's so useful that we should expect platforms >> to have efficient implementations. Googling around, ARM seems to have a QADD instruction that does that. But AFAICS it's not an atomic instruction that you could use for synchronization purposes, just a regular instruction. > It's useful for my work to get rid of most LockBufHdr() calls (to > manipulate usagecount locklessly). That's why I added it. We can delay > it till that patch is ready, but I don't really see the benefit. Yeah, please leave it out for now, we can argue about it later. Even if we want it in the future, it would be just dead, untested code now. - Heikki
On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote: > On 09/24/2014 07:57 PM, Andres Freund wrote: > >On 2014-09-24 12:44:09 -0400, Tom Lane wrote: > >>Andres Freund <andres@2ndquadrant.com> writes: > >>>On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote: > >>>>There doesn't seem to be any hardware implementations of that in the patch. > >>>>Is there any architecture that has an instruction or compiler intrinsic for > >>>>that? > >> > >>>You can implement it rather efficiently on ll/sc architectures. But I > >>>don't really think it matters. I prefer add_until (I've seen it named > >>>saturated add before as well) to live in the atomics code, rather than > >>>reimplement it in atomics employing code. I guess you see that > >>>differently? > >> > >>I think the question is more like "what in the world happened to confining > >>ourselves to a small set of atomics". > > > >I fail to see why the existance of a wrapper around compare-exchange > >(which is one of the primitives we'd agreed upon) runs counter to > >the agreement that we'll only rely on a limited number of atomics on the > >hardware level? > > It might be a useful function, but if there's no hardware implementation for > it, it doesn't belong in atomics.h. We don't want to turn it into a general > library of useful little functions. Uh. It belongs there because it *atomically* does a saturated add? That's precisely what atomics.h is about, so I don't see why it'd belong anywhere else. > >>I doubt either that this exists > >>natively anywhere, or ethat it's so useful that we should expect platforms > >>to have efficient implementations. > > Googling around, ARM seems to have a QADD instruction that does that. But > AFAICS it's not an atomic instruction that you could use for synchronization > purposes, just a regular instruction. On ARM - and many other LL/SC architectures - you can implement it atomically using LL/SC. In pseudocode: while (true) { load_linked(mem, reg); if (reg + add < limit) reg = reg + add; else reg = limit; if (store_conditional(mem,reg)) break } That's how pretty much *all* atomic instructions work on such architectures. > >It's useful for my work to get rid of most LockBufHdr() calls (to > >manipulate usagecount locklessly). That's why I added it. We can delay > >it till that patch is ready, but I don't really see the benefit. > > Yeah, please leave it out for now, we can argue about it later. Even if we > want it in the future, it would be just dead, untested code now. Ok, will remove it for now. I won't repost a version with it removed, as removing a function as the only doesn't seem to warrant reposting it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 09/24/2014 07:57 PM, Andres Freund wrote: >> On 2014-09-24 12:44:09 -0400, Tom Lane wrote: >>> I think the question is more like "what in the world happened to confining >>> ourselves to a small set of atomics". >> I fail to see why the existance of a wrapper around compare-exchange >> (which is one of the primitives we'd agreed upon) runs counter to >> the agreement that we'll only rely on a limited number of atomics on the >> hardware level? > It might be a useful function, but if there's no hardware implementation > for it, it doesn't belong in atomics.h. We don't want to turn it into a > general library of useful little functions. Note that the spinlock code separates s_lock.h (hardware implementations) from spin.h (a hardware-independent abstraction layer). Perhaps there's room for a similar separation here. I tend to agree with Heikki that wrappers around compare-exchange ought not be conflated with compare-exchange itself, even if there might theoretically be architectures where the wrapper function could be implemented directly. regards, tom lane
On 2014-09-24 14:28:18 -0400, Tom Lane wrote: > Note that the spinlock code separates s_lock.h (hardware implementations) > from spin.h (a hardware-independent abstraction layer). Perhaps there's > room for a similar separation here. Luckily that separation exists ;). The hardware dependant bits are in different files than the platform independent functionality on top of them (atomics/generic.h). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-24 20:27:39 +0200, Andres Freund wrote: > On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote: > I won't repost a version with it removed, as removing a function as the > only doesn't seem to warrant reposting it. I've fixed (thanks Alvaro!) some minor additional issues besides the removal and addressing earlier comments from Heikki: * removal of two remaining non-ascii copyright signs. The only non ascii name character that remains is Alvaro's name in the commit message. * additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE * there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split into different configure tests and thus has a different name. A later patch entirely removes that reference, which is why I'd missed that... Heikki has marked the patch as 'ready for commiter' in the commitfest (conditional on his remarks being addressed) and I agree. There seems to be little benefit in waiting further. There *definitely* will be some platform dependant issues, but that won't change by waiting longer. I plan to commit this quite soon unless somebody protests really quickly. Sorting out the issues on platforms I don't have access to based on buildfarm feedback will take a while given how infrequent some of the respective animals run... But that's just life. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres, * Andres Freund (andres@anarazel.de) wrote: > Heikki has marked the patch as 'ready for commiter' in the commitfest > (conditional on his remarks being addressed) and I agree. There seems to > be little benefit in waiting further. There *definitely* will be some > platform dependant issues, but that won't change by waiting longer. Agreed. I won't claim to have done as good a review as others, but I've been following along as best I've been able to. > I plan to commit this quite soon unless somebody protests really > quickly. +1 from me for at least moving forward to see what happens. > Sorting out the issues on platforms I don't have access to based on > buildfarm feedback will take a while given how infrequent some of the > respective animals run... But that's just life. The buildfarm is another tool (as well as Coverity) that provides a lot of insight into issues we may not realize ahead of time, but using them (currently) means we have to eventually go beyond looking at the code individually and actually commit it. I'm very excited to see this progress and trust you, Heikki, Robert, and the others involved in this work to have done an excellent job and to be ready and prepared to address any issues which come from it, or to pull it out if it turns out to be necessary (which I seriously doubt, but still). Thanks! Stephen
On Thu, Sep 25, 2014 at 3:23 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Andres Freund (andres@anarazel.de) wrote: >> Heikki has marked the patch as 'ready for commiter' in the commitfest >> (conditional on his remarks being addressed) and I agree. There seems to >> be little benefit in waiting further. There *definitely* will be some >> platform dependant issues, but that won't change by waiting longer. > > Agreed. I won't claim to have done as good a review as others, but I've > been following along as best I've been able to. FWIW, I agree that you've already done plenty to ensure that this works well on common platforms. We cannot reasonably expect you to be just as sure that there are not problems on obscure platforms ahead of commit. It isn't unheard of for certain committers (that don't like Windows, say), to commit things despite having a suspicion that their patch will turn some fraction of the buildfarm red. To a certain extent, that's what we have it for. -- Peter Geoghegan
On Thu, Sep 25, 2014 at 6:03 PM, Andres Freund <andres@anarazel.de> wrote: > On 2014-09-24 20:27:39 +0200, Andres Freund wrote: >> On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote: >> I won't repost a version with it removed, as removing a function as the >> only doesn't seem to warrant reposting it. > > I've fixed (thanks Alvaro!) some minor additional issues besides the > removal and addressing earlier comments from Heikki: > * removal of two remaining non-ascii copyright signs. The only non ascii > name character that remains is Alvaro's name in the commit message. > * additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE > * there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split > into different configure tests and thus has a different name. A later > patch entirely removes that reference, which is why I'd missed that... > > Heikki has marked the patch as 'ready for commiter' in the commitfest > (conditional on his remarks being addressed) and I agree. There seems to > be little benefit in waiting further. There *definitely* will be some > platform dependant issues, but that won't change by waiting longer. > > I plan to commit this quite soon unless somebody protests really > quickly. > > Sorting out the issues on platforms I don't have access to based on > buildfarm feedback will take a while given how infrequent some of the > respective animals run... But that's just life. I feel like this could really use a developer README: what primitives do we have, which ones are likely to be efficient or inefficient on which platforms, how do atomics interact with barriers, etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-25 21:02:28 -0400, Robert Haas wrote: > I feel like this could really use a developer README: what primitives > do we have, which ones are likely to be efficient or inefficient on > which platforms, how do atomics interact with barriers, etc. atomics.h has most of that. Including documenting which barrier semantics the individual operations have. One change I can see as being a good idea is to move/transform the definition of acquire/release barriers from s_lock.h to README.barrier. It doesn't document which operations are efficient on which platform, but I wouldn't know what to say about that topic? The current set of operations should be fast on pretty much all platforms. What information would you like to see? An example of how to use atomics? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 25, 2014 at 07:14:34PM +0200, Andres Freund wrote: > * gcc, msvc work. acc, xlc, sunpro have blindly written support which > should be relatively easy to fix up. I tried this on three xlc configurations. (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working required the attached patch. None of my xlc configurations have an <atomic.h> header, and a web search turned up no evidence of one in connection with xlc platforms. Did you learn of a configuration needing atomic.h under xlc? The rest of the changes are hopefully self-explanatory in light of the documentation cited in generic-xlc.h. (Building on AIX has regressed in other ways unrelated to atomics; I will write more about that in due course.) (2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le, http://www-01.ibm.com/support/docview.wss?uid=swg27044056&aid=1. This compiler has a Clang-derived C frontend. It defines __GNUC__ and offers GCC-style __sync_* atomics. Therefore, PostgreSQL selects generic-gcc.h. test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types segfaults at runtime. I have reported this to the vendor. Adding "pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good workaround. I could add a comment about that to src/test/regress/sql/lock.sql for affected folks to see in regression.diffs. To do better, we could make PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible. Yet another option is to force use of generic-xlc.h on this compiler. (3) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le, modifying atomics.h to force use of generic-xlc.h. While not a supported PostgreSQL configuration, I felt this would make an interesting data point. It worked fine after applying the patch developed for the AIX configuration. Thanks, nm
Attachment
On 2015-07-04 18:40:41 -0400, Noah Misch wrote: > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working > required the attached patch. None of my xlc configurations have an <atomic.h> > header, and a web search turned up no evidence of one in connection with xlc > platforms. Did you learn of a configuration needing atomic.h under xlc? The > rest of the changes are hopefully self-explanatory in light of the > documentation cited in generic-xlc.h. (Building on AIX has regressed in other > ways unrelated to atomics; I will write more about that in due course.) I wrote this entirely blindly, as evidenced here by the changes you needed. At the time somebody had promised to soon put up an aix animal, but that apparently didn't work out. Will you apply? Having the ability to test change seems to put you in a much better spot then me. > (2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le, > http://www-01.ibm.com/support/docview.wss?uid=swg27044056&aid=1. This > compiler has a Clang-derived C frontend. It defines __GNUC__ and offers > GCC-style __sync_* atomics. Phew. I don't see much reason to try to support this. Why would that be interesting? > Therefore, PostgreSQL selects generic-gcc.h. > test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types > segfaults at runtime. I have reported this to the vendor. Adding > "pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good > workaround. I could add a comment about that to src/test/regress/sql/lock.sql > for affected folks to see in regression.diffs. To do better, we could make > PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible. Yet > another option is to force use of generic-xlc.h on this compiler. It seems fair enough to simply add another test and include generic-xlc.h in that case. If it's indeed xlc, why not? > (3) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le, > modifying atomics.h to force use of generic-xlc.h. While not a supported > PostgreSQL configuration, I felt this would make an interesting data point. > It worked fine after applying the patch developed for the AIX configuration. Cool. Greetings, Andres Freund
On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote: > On 2015-07-04 18:40:41 -0400, Noah Misch wrote: > > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working > > required the attached patch. > Will you apply? Having the ability to test change seems to put you in a > much better spot then me. I will. > > (2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le, > > http://www-01.ibm.com/support/docview.wss?uid=swg27044056&aid=1. This > > compiler has a Clang-derived C frontend. It defines __GNUC__ and offers > > GCC-style __sync_* atomics. > > Phew. I don't see much reason to try to support this. Why would that be > interesting? > > > Therefore, PostgreSQL selects generic-gcc.h. > > test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types > > segfaults at runtime. I have reported this to the vendor. Adding > > "pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good > > workaround. I could add a comment about that to src/test/regress/sql/lock.sql > > for affected folks to see in regression.diffs. To do better, we could make > > PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible. Yet > > another option is to force use of generic-xlc.h on this compiler. > > It seems fair enough to simply add another test and include > generic-xlc.h in that case. If it's indeed xlc, why not? Works for me. I'll do that as attached.
Attachment
On Sat, Jul 04, 2015 at 08:07:49PM -0400, Noah Misch wrote: > On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote: > > On 2015-07-04 18:40:41 -0400, Noah Misch wrote: > > > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working > > > required the attached patch. > > > Will you apply? Having the ability to test change seems to put you in a > > much better spot then me. > > I will. That commit (0d32d2e) permitted things to compile and usually pass tests, but I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen sixteen duplicate-catalog-OID failures. The compiler has always been xlC, and the branch has been HEAD or 9.5. Examples: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-15%2004%3A12%3A49 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-08%2001%3A20%3A16 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2015-12-11%2005%3A25%3A54 These suggested OidGenLock wasn't doing its job. I've seen similar symptoms around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le. The problem is generic-xlc.h pg_atomic_compare_exchange_u32_impl() issuing __isync() before __compare_and_swap(). __isync() shall follow __compare_and_swap(); see our own s_lock.h, its references, and other projects' usage: https://github.com/boostorg/smart_ptr/blob/boost-1.60.0/include/boost/smart_ptr/detail/sp_counted_base_vacpp_ppc.hpp#L55 http://redmine.gromacs.org/projects/gromacs/repository/entry/src/external/thread_mpi/include/thread_mpi/atomic/xlc_ppc.h?rev=v5.1.2#L112 https://www-01.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/asm_example.html This patch's test case would have failed about half the time under today's generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug 99% of the time would run far longer, hence this compromise. Archaeology enthusiasts may enjoy the bug's changing presentation over time. LWLockAcquire() exposed it after commit ab5194e redesigned lwlock.c in terms of atomics. Assert builds were unaffected for commits in [ab5194e, bbfd7ed); atomics.h asserts fattened code enough to mask the bug. However, removing the AssertPointerAlignment() from pg_atomic_fetch_or_u32() would have sufficed to surface it in assert builds. All builds demonstrated the bug once bbfd7ed introduced xlC pg_attribute_noreturn(), which elicits a simpler instruction sequence around the ExceptionalCondition() call embedded in each Assert. nm
Attachment
On 2016-02-15 12:11:29 -0500, Noah Misch wrote: > These suggested OidGenLock wasn't doing its job. I've seen similar symptoms > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, > 5765-J08)" for ppc64le. The problem is generic-xlc.h > pg_atomic_compare_exchange_u32_impl() issuing __isync() before > __compare_and_swap(). __isync() shall follow __compare_and_swap(); see our > own s_lock.h, its references, and other projects' usage: Ugh. You're right! It's about not moving code before the stwcx... Do you want to add the new test (no objection, curious), or is that more for testing? Greetings, Andres
On Mon, Feb 15, 2016 at 06:33:42PM +0100, Andres Freund wrote: > On 2016-02-15 12:11:29 -0500, Noah Misch wrote: > > These suggested OidGenLock wasn't doing its job. I've seen similar symptoms > > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, > > 5765-J08)" for ppc64le. The problem is generic-xlc.h > > pg_atomic_compare_exchange_u32_impl() issuing __isync() before > > __compare_and_swap(). __isync() shall follow __compare_and_swap(); see our > > own s_lock.h, its references, and other projects' usage: > > Ugh. You're right! It's about not moving code before the stwcx... > > Do you want to add the new test (no objection, curious), or is that more > for testing? The patch's test would join PostgreSQL indefinitely.
Noah Misch <noah@leadboat.com> writes: > That commit (0d32d2e) permitted things to compile and usually pass tests, but > I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen > sixteen duplicate-catalog-OID failures. I'd been wondering about those ... > These suggested OidGenLock wasn't doing its job. I've seen similar symptoms > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, > 5765-J08)" for ppc64le. The problem is generic-xlc.h > pg_atomic_compare_exchange_u32_impl() issuing __isync() before > __compare_and_swap(). __isync() shall follow __compare_and_swap(); see our > own s_lock.h, its references, and other projects' usage: Nice catch! > This patch's test case would have failed about half the time under today's > generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug > 99% of the time would run far longer, hence this compromise. Sounds like a reasonable compromise to me, although I wonder about the value of it if we stick it into pgbench's TAP tests. How many of the slower buildfarm members are running the TAP tests? Certainly mine are not. regards, tom lane
On 5 July 2015 at 06:54, Andres Freund <andres@anarazel.de> wrote:
I wrote this entirely blindly, as evidenced here by the changes you
needed. At the time somebody had promised to soon put up an aix animal,
but that apparently didn't work out.
They're not desperately keen to get their products out there for testing.
--
On Mon, Feb 15, 2016 at 02:02:41PM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > That commit (0d32d2e) permitted things to compile and usually pass tests, but > > I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen > > sixteen duplicate-catalog-OID failures. > > I'd been wondering about those ... > > > These suggested OidGenLock wasn't doing its job. I've seen similar symptoms > > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, > > 5765-J08)" for ppc64le. The problem is generic-xlc.h > > pg_atomic_compare_exchange_u32_impl() issuing __isync() before > > __compare_and_swap(). __isync() shall follow __compare_and_swap(); see our > > own s_lock.h, its references, and other projects' usage: > > Nice catch! > > > This patch's test case would have failed about half the time under today's > > generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug > > 99% of the time would run far longer, hence this compromise. > > Sounds like a reasonable compromise to me, although I wonder about the > value of it if we stick it into pgbench's TAP tests. How many of the > slower buildfarm members are running the TAP tests? Certainly mine are > not. I missed a second synchronization bug in generic-xlc.h, but the pgbench test suite caught it promptly after commit 008608b. Buildfarm members mandrill and hornet failed[1] or deadlocked within two runs. The bug is that the pg_atomic_compare_exchange_*() specifications grant "full barrier semantics", but generic-xlc.h provided only the semantics of an acquire barrier. Commit 008608b exposed that older problem; its LWLockWaitListUnlock() relies on pg_atomic_compare_exchange_u32() (via pg_atomic_fetch_and_u32()) for release barrier semantics. The fix is to issue "sync" before each compare-and-swap, in addition to the "isync" after it. This is consistent with the corresponding GCC atomics. The pgbench test suite survived dozens of runs so patched. [1] http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2016-04-11%2003%3A14%3A13
Attachment
On 2016-04-23 21:54:07 -0400, Noah Misch wrote: > I missed a second synchronization bug in generic-xlc.h, but the pgbench test > suite caught it promptly after commit 008608b. Nice catch. > The bug is that the pg_atomic_compare_exchange_*() specifications > grant "full barrier semantics", but generic-xlc.h provided only the > semantics of an acquire barrier. I find the docs at http://www-01.ibm.com/support/knowledgecenter/SSGH3R_13.1.2/com.ibm.xlcpp131.aix.doc/compiler_ref/bifs_sync_atomic.html to be be awfully silent about that matter. I guess you just looked at the assembler code? It's nice that one can figure out stuff like that from an architecture manual, but it's sad that the docs for the intrinsics is silent about that matter. I do wonder if I shouldn't have left xlc fall by the wayside, and make it use the fallback implementation. Given it's popularity (or rather lack thereof) that'd probably have been a better choice. But that ship has sailed. Except that I didn't verify the rs6000_pre_atomic_barrier() and __fetch_and_add() internals about emitted sync/isync, the patch looks good. We've so far not referred to "sequential consistency", but given it's rise in popularity, I don't think it hurts. Stricly speaking generic-xlc.c shouldn't contain inline-asm, but given xlc is only there for power... Greetings, Andres Freund
On Mon, Apr 25, 2016 at 11:52:04AM -0700, Andres Freund wrote: > On 2016-04-23 21:54:07 -0400, Noah Misch wrote: > > The bug is that the pg_atomic_compare_exchange_*() specifications > > grant "full barrier semantics", but generic-xlc.h provided only the > > semantics of an acquire barrier. > > I find the docs at > http://www-01.ibm.com/support/knowledgecenter/SSGH3R_13.1.2/com.ibm.xlcpp131.aix.doc/compiler_ref/bifs_sync_atomic.html > to be be awfully silent about that matter. I guess you just looked at > the assembler code? It's nice that one can figure out stuff like that > from an architecture manual, but it's sad that the docs for the > intrinsics is silent about that matter. Right. The intrinsics provide little value as abstractions if one checks the generated code to deduce how to use them. I was tempted to replace the intrinsics calls with inline asm. At least these functions are unlikely to change over time. > Except that I didn't verify the rs6000_pre_atomic_barrier() and > __fetch_and_add() internals about emitted sync/isync, the patch looks > good. We've so far not referred to "sequential consistency", but given > it's rise in popularity, I don't think it hurts. Thanks; committed.